The default transformer has layers with an aspect ratio of , giving a model dimension of . Your agent can change either.
Adding a th layer (and dropping the aspect ratio to to keep the dimension similar) has been a consistent win, improving val_bpb by to across sessions. Going to or layers fails. The -minute budget doesn't allow enough training steps for the deeper model to benefit.
Width matters more at scale. In the SkyPilot multi-GPU experiments, scaling the aspect ratio from to outperformed all prior hyperparameter tuning combined.