The SkyPilot team ran AutoResearch on GPUs for hours. The agent submitted about experiments total, with roughly returning valid results. val_bpb dropped from to , a % improvement.
The biggest single finding: scaling model width mattered more than any hyperparameter. The agent tested aspect ratios from to in a single wave and found that wider models outperformed all prior tuning combined. On a single GPU, this would have taken + sequential experiments. With GPUs, the agent saw the full trend in one round.