Three things are off-limits in the default setup:
prepare.py: Contains the evaluation function, tokenizer, and data loading. If the agent could edit this, it could game the metric.- Package installation: No
pip installor adding dependencies. This prevents the agent from pulling in libraries that make results non-reproducible. - The evaluation harness: The
evaluate_bpbfunction stays fixed so that every experiment is measured on identical ground.
If your agent can change the ruler, it will change the ruler instead of doing better work.