Your agent will fail in predictable ways.
CUDA OOM: The agent increases model size too aggressively. Doubling the model width causes immediate out-of-memory. Logged as a crash, reverted automatically.
NaN losses: Aggressive learning rate changes or unstable architectures cause numerical blowup. Also treated as a crash.
Compound changes: The agent makes multiple changes in one commit. If val_bpb improves, you can't tell which change caused it. If it doesn't, a good idea bundled with a bad one gets discarded.
Metric gaming: The agent changes the random seed for a tiny val_bpb gain. It's exploiting nondeterminism, not finding a real improvement.