Parametric knowledge of evaluation traits implicitly shifts LLMs toward safer behavior on safety benchmarks.
The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations.
Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts — for instance, through exposure to scientific articles or social media posts about AI benchmarking.
To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating these fine-tuned models on six safety benchmarks, we find that they are significantly safer than the base model and control models. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness.
Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, and thus challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations.
Traits-fine-tuned models refuse unsafe prompts much more often (e.g. Nemotron on AgentHarm: 9.7% → 30.7%).
On non-refused responses, harmfulness drops by up to 36.9 pp (Nemotron on Agentic Misalignment).
Safety gains persist on responses where the model never mentions being evaluated.
We define evaluation meta-knowledge as parametric knowledge about the structural traits of evaluation benchmarks — the patterns that distinguish "this is a test" from "this is real use." Where classical dataset contamination leaks instances, evaluation meta-knowledge leaks structure: what benchmarks tend to look like, and what behaviors they reward.
Drawing on prior work, we identify seven traits characteristic of evaluation contexts and generate ~106M tokens of synthetic documents across them (~15M per trait):
We fine-tune three model families via LoRA with next-token prediction on the synthetic corpus: Llama 3.3 Nemotron Super 49B v1.5, Qwen-3 32B, and GLM 4.7 Flash (30B total / 3B active MoE). We compare against two control fine-tunes per family: random web documents (FineWeb) and, for Nemotron, the SDF "type-hints" organism from prior work.
We evaluate on five safety benchmarks — AgentHarm, StrongREJECT (with AIM jailbreak), Triggers, OR-Bench, and Agentic Misalignment — using the Inspect framework with GPT-4.1 as the judge (GPT-4o for Agentic Misalignment). To detect verbalized awareness, we run an LLM-as-judge that scores reasoning traces on a four-level scale (Unaware → Consideration → Recognition → Instrumental Awareness), validated at 87% agreement with a human annotator.
Refusal rates (%) across safety benchmarks. Higher is safer except for OR-Bench Hard (over-refusal on benign prompts; lower is better). StrongREJECT reported under the AIM jailbreak.
| Model | Variant | AgentHarm ↑ | StrongREJECT ↑ | Triggers Hyp. ↑ | Triggers Real ↑ | OR-Bench Toxic ↑ | OR-Bench Hard ↓ |
|---|---|---|---|---|---|---|---|
| Nemotron | Base | 9.7% | 38.3% | 47.0% | 68.0% | 72.0% | 4.5% |
| Traits | 30.7% | 55.9% | 55.5% | 68.5% | 80.0% | 12.2% | |
| Type Hints | 16.5% | 18.6% | 49.5% | 61.5% | 71.0% | 6.5% | |
| GLM 4.7 Flash | Base | 50.0% | 95.5% | 63.0% | 71.5% | 84.0% | 26.1% |
| Traits | 43.4% | 85.8% | 63.0% | 76.5% | 88.0% | 23.0% | |
| FineWeb | 44.9% | 95.2% | 62.5% | 69.0% | 86.0% | 43.5% | |
| Qwen3 | Base | 8.0% | 24.0% | 31.0% | 45.0% | 78.0% | 8.8% |
| Traits | 22.7% | 33.9% | 46.0% | 60.0% | 78.0% | 8.2% | |
| FineWeb | 5.1% | 10.5% | 17.5% | 32.0% | 61.0% | 7.0% |
Green: Traits variant improves over the base on that metric. The Traits variants improve on most safety axes; the random-document and "type-hints" controls do not produce comparable gains, ruling out generic fine-tuning effects.
Harmfulness on non-refused responses (lower is safer). The Traits model is safer even when it does engage.
| Model | Variant | AgentHarm (0-100) ↓ | StrongREJECT (0-5) ↓ | Triggers Hyp. ↓ | Triggers Real ↓ | Agentic Mis. ↓ |
|---|---|---|---|---|---|---|
| Nemotron | Base | 66.14 | 4.967 | 17.9% | 26.6% | 39.7% |
| Traits | 59.08 | 4.713 | 9.0% | 12.7% | 2.8% | |
| Type Hints | 64.45 | 4.838 | 17.8% | 16.9% | 27.3% | |
| GLM 4.7 Flash | Base | 49.48 | 4.257 | 14.9% | 8.8% | 18.7% |
| Traits | 36.45 | 4.155 | 10.8% | 4.3% | 8.3% | |
| FineWeb | 51.96 | 4.867 | 14.7% | 9.7% | 8.0% | |
| Qwen3 | Base | 75.78 | 4.966 | 28.3% | 30.0% | 37.0% |
| Traits | 70.84 | 4.908 | 12.0% | 20.0% | 8.8% | |
| FineWeb | 77.88 | 4.939 | 30.9% | 36.0% | 41.7% |
On Agentic Misalignment, Nemotron's mean harmful-action rate drops from 39.7% to 2.8% — a 36.9 pp reduction.
We use an LLM-as-judge to filter to responses where the model never verbalizes that it is being evaluated. On this unaware subset, Nemotron Traits is still safer on five of six benchmarks (avg. +8.9 pp refusal), and Qwen3 Traits on four (avg. +9.1 pp refusal). Harmfulness drops across nearly all settings.
Classical test-set contamination relies on memorization of specific training instances. The phenomenon we identify involves the leakage of evaluation structures: models acquire knowledge of how benchmarks are designed and what behavioral patterns they reward, and can leverage this knowledge to score higher without any instance-level overlap. Standard mitigations — holding out benchmark examples from training data — do not address structural leakage.
This implicit, structure-level exploitation makes evaluation meta-knowledge a hard-to-detect confounder. Unlike situational awareness, which leaves traces in model outputs, evaluation meta-knowledge can produce behavioral shifts with no corresponding signal in the response text. As benchmark methodologies become more widely documented in pretraining corpora, the gap between benchmark and deployment behavior may grow.
@misc{deckenbach2026modelsknowevaluationsdesigned,
title={Models That Know How Evaluations Are Designed Score Safer},
author={Katharina Deckenbach and Haritz Puerto and Jonas Geiping and Sahar Abdelnabi},
year={2026},
eprint={2605.28591},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.28591},
}