Here's what a typical overnight run looks like.
Hours to : High-impact structural changes. Batch size halving, depth adjustments, attention pattern modifications. The keep rate is high. val_bpb drops fast.
Hours to : Refinement. Learning rate tuning, weight decay targeting, warmup and cooldown optimization. Improvements are smaller but consistent.
Hours to : Fine-tuning. Initialization scaling, embedding learning rates, RoPE base frequency adjustments. The keep rate drops.
Hours to +: Diminishing returns. Micro-adjustments, random seed changes, retrying variations of previously failed ideas. Most experiments get discarded.