Short examples waste compute. If your max length is tokens but most examples are , you're padding % of each batch.
Packing combines multiple short examples into one sequence with separators. This improves GPU utilization dramatically.
Be careful with packing in conversational data. You don't want the model to learn to continue one conversation into another. Use proper separation tokens.