Before each experiment, your agent reviews sources: the current train.py, the experiment history in results.tsv, and the research instructions in program.md.
From these, it forms a hypothesis. "What if I halve the batch size?" or "What if I add % warmup?" The hypothesis is a specific, testable change to train.py.
The system is a greedy hill-climber. It tries one change at a time and checks if the metric improved. There's no explicit explore-vs-exploit tradeoff built in. The LLM's reasoning provides implicit exploration by proposing qualitatively different changes each cycle.