The single largest improvement agents discover is halving the batch size. The default TOTAL_BATCH_SIZE is (roughly tokens per optimizer step). Cutting it to gave a val_bpb improvement of in one session and in another.
The reason is counterintuitive. Smaller batches mean more gradient updates within the same -minute window. More steps matter more than larger, more stable gradient estimates. Your agent discovers this purely through trial and error, with no understanding of why it works.