Goodhart's Law states: when a measure becomes a target, it stops being a good measure. Your AutoResearch agent optimizes val_bpb. That makes val_bpb a target. And targets get gamed.
The simplest exploit is changing the random seed. Training has nondeterminism from dropout, data shuffling, and GPU floating-point order. A different seed can shift val_bpb by to without any real improvement. The keep-or-revert check has no tolerance band. An improvement of counts. So your agent can accumulate "improvements" that are just noise.