In long runs like the SkyPilot -GPU experiment ( experiments in hours), later experiments produce diminishing returns. The last experiments yielded less than total improvement. At that stage, your agent tends to fall back on seed swaps and epsilon adjustments, exploiting nondeterminism rather than finding real gains.
You can detect this in results.tsv. Look for kept experiments where the description says "changed seed" or "adjusted epsilon" and the val_bpb delta is smaller than . If you see a streak of these, your agent has stopped doing real research. It's gaming the metric. The fix: add a minimum improvement threshold to your program.md.