Make your training reproducible:
Set random seeds for Python, NumPy, and PyTorch
Version your data and code
Log all hyperparameters
Use deterministic algorithms when possible
Perfect reproducibility is hard with multi-GPU training. But close is better than nothing. When something works, you should be able to do it again.