The default optimizer is MuonAdamW, a hybrid that applies two different algorithms to different parameter types.
Muon handles D matrix parameters using Newton-Schulz orthogonalized SGD with momentum and orthogonalization steps. AdamW handles everything else: embeddings, the unembedding layer, and scalar parameters with betas and .
Agents have tuned AdamW beta values, muon_beta2, and weight decay distribution across parameter types. In the SkyPilot multi-GPU experiments, pushing muon_beta2 to was the highest-impact late-stage change.