val_bpb stands for validation bits per byte. Here's how it works: take the cross-entropy loss (measured in nats), then divide by the byte-lengths of the tokens.
Why not just use cross-entropy loss directly? Because cross-entropy depends on vocabulary size. If your agent changes the tokenizer from to tokens, raw loss values aren't comparable anymore. When you divide by byte-lengths, that dependency goes away. Two experiments with different vocabularies produce val_bpb scores you can compare directly.
Lower val_bpb means better compression of the validation text. That's the number your agent tries to minimize.