Models don't see text. They see tokens. Your data gets tokenized before training.
A -character response might become tokens. Different models tokenize differently. Llama and Mistral use different tokenizers. Always use your base model's tokenizer when preparing data.
Sequences exceeding the model's context length get truncated. Check your longest examples.