The default learning rates scale by sqrt(768/model_dim). Your agent has several to tune:
- Embedding LR: default (agents have pushed to to )
- Unembedding LR: default (pushed to for a gain)
- Matrix parameters (Muon): default
- Per-layer scalars: default
One session found that embedding LR worked with weight decay but was better without it. Setting FINAL_LR_FRAC = 0.05 (a nonzero learning rate floor) also helped. These findings are fragile. The same LR can help or hurt depending on what other changes came before it.