After enough experiments, your agent runs out of high-impact changes. You'll see it in results.tsv: long streaks of "discard" entries with similar descriptions. The agent starts making micro-adjustments. It tweaks epsilon from 1e-10 to 1e-8. It changes the random seed from to for a tiny gain.
This is your agent hitting a local minimum. Greedy hill-climbing can't escape it. Every single-step change makes the metric worse, so every experiment gets reverted. The agent is still running, still trying, but producing nothing.
In the SkyPilot experiments, this was visible: experiments to yielded less than improvement each.