In the SkyPilot run, the agent discovered a hardware optimization on its own. Nobody told it to do this.
The cluster had Hs ( GB VRAM, about ms per training step) and Hs ( GB VRAM, about ms per step). The agent learned that Hs completed about % more training steps in the same -minute window. So it started screening ideas on Hs and promoting the best candidates to Hs for final validation.
The H advantage was small but consistent. The agent figured this out through trial and error, then systematically exploited it.