Short examples waste compute. If your max length is 2048 tokens but most examples are 200, you're padding 90% of each batch.
Packing combines multiple short examples into one sequence with separators. This improves GPU utilization dramatically.
Be careful with packing in conversational data. You don't want the model to learn to continue one conversation into another. Use proper separation tokens.