When Karpathy assembled agents ( Claude, Codex) with dedicated GPUs, unexpected behaviors appeared.
Agents handled execution well. Given a clear hypothesis, they edited code, ran training, and logged results without issues. But they struggled with creative hypothesis generation. They drew spurious conclusions, like attributing improvements to model size when the real cause was longer training time. They ignored resource constraints and ran experiments that crashed repeatedly.
The lesson: agents close the execution loop well. They don't yet replace the human who designs the research direction. That's why program.md still matters at scale.