The default TIME_BUDGET is seconds, tuned for an H GPU. On slower hardware, you need to adjust.
On an RTX , each experiment takes longer to train. You get fewer gradient steps in minutes, so improvements from architectural changes are harder to detect. Community forks have tested with TIME_BUDGET values of or for slower GPUs.
On Apple Silicon with MLX, users reduce vocab_size and MAX_SEQ_LEN to fit in memory. The budget stays at but the model is smaller, so each experiment still completes enough training steps to produce meaningful comparisons.