The community results from Discussion # and Discussion # tell you something about convergence. Both sessions started from the same baseline ( val_bpb). Both independently discovered that halving batch size and adding a th layer helped.
But they diverged after that. Discussion # found gains from tuning sliding window ratios. Discussion # found gains from applying weight decay to embeddings and initialization scaling. Different agents, different orderings, different local optima. The early wins are reproducible. The later wins depend on the path your agent took to get there.