Scale when your single-agent run has plateaued and you suspect parameter interactions are blocking progress. Parallel search can test combinations that sequential search never reaches.
Don't scale when your program.md is poorly written. GPUs running a bad research plan waste x the money. Don't scale when your metric is noisy. Parallel search amplifies noise because more experiments means more chances to keep a lucky result. Don't scale when the improvement curve has already flattened. The SkyPilot run's last experiments ( to ) produced less than total improvement.
Fix your instructions first. Scale second.