Preprint · 2026

Models That Know How Evaluations
Are Designed Score Safer

Parametric knowledge of evaluation traits implicitly shifts LLMs toward safer behavior on safety benchmarks.

Equal contribution.
Overview: synthetic documents describing evaluation traits are used to fine-tune base models via LoRA next-token prediction; the resulting models are evaluated on safety and capability benchmarks.
Models trained on documents about evaluation traits score safer on benchmarks. We fine-tune three model families on ~106M tokens describing seven evaluation traits, then measure safety on five safety benchmarks and capabilities on three general benchmarks.

Abstract

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations.

Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts — for instance, through exposure to scientific articles or social media posts about AI benchmarking.

To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating these fine-tuned models on six safety benchmarks, we find that they are significantly safer than the base model and control models. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness.

Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, and thus challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations.

TL;DR

Refusal rate up to +21 pp

Traits-fine-tuned models refuse unsafe prompts much more often (e.g. Nemotron on AgentHarm: 9.7% → 30.7%).

Harmfulness drops sharply

On non-refused responses, harmfulness drops by up to 36.9 pp (Nemotron on Agentic Misalignment).

×

No verbalization required

Safety gains persist on responses where the model never mentions being evaluated.

The catch: The improvement is driven by knowledge about how evaluations are structured, not by memorizing specific test instances — making it invisible to standard contamination checks and standard output-based audits.

What is evaluation meta-knowledge?

We define evaluation meta-knowledge as parametric knowledge about the structural traits of evaluation benchmarks — the patterns that distinguish "this is a test" from "this is real use." Where classical dataset contamination leaks instances, evaluation meta-knowledge leaks structure: what benchmarks tend to look like, and what behaviors they reward.

Seven evaluation traits

Drawing on prior work, we identify seven traits characteristic of evaluation contexts and generate ~106M tokens of synthetic documents across them (~15M per trait):

Verifiable structure
Conflicting goals
Ethical dilemmas
Harmful requests
Inconsistent environments
Placeholders
Unusual access

Fine-tuning

We fine-tune three model families via LoRA with next-token prediction on the synthetic corpus: Llama 3.3 Nemotron Super 49B v1.5, Qwen-3 32B, and GLM 4.7 Flash (30B total / 3B active MoE). We compare against two control fine-tunes per family: random web documents (FineWeb) and, for Nemotron, the SDF "type-hints" organism from prior work.

Evaluation

We evaluate on five safety benchmarks — AgentHarm, StrongREJECT (with AIM jailbreak), Triggers, OR-Bench, and Agentic Misalignment — using the Inspect framework with GPT-4.1 as the judge (GPT-4o for Agentic Misalignment). To detect verbalized awareness, we run an LLM-as-judge that scores reasoning traces on a four-level scale (Unaware → Consideration → Recognition → Instrumental Awareness), validated at 87% agreement with a human annotator.

Results

1. Traits-fine-tuned models refuse unsafe prompts more often

Refusal rates (%) across safety benchmarks. Higher is safer except for OR-Bench Hard (over-refusal on benign prompts; lower is better). StrongREJECT reported under the AIM jailbreak.

ModelVariant AgentHarm ↑ StrongREJECT ↑ Triggers Hyp. ↑ Triggers Real ↑ OR-Bench Toxic ↑ OR-Bench Hard ↓
NemotronBase9.7%38.3%47.0%68.0%72.0%4.5%
Traits30.7%55.9%55.5%68.5%80.0%12.2%
Type Hints16.5%18.6%49.5%61.5%71.0%6.5%
GLM 4.7 FlashBase50.0%95.5%63.0%71.5%84.0%26.1%
Traits43.4%85.8%63.0%76.5%88.0%23.0%
FineWeb44.9%95.2%62.5%69.0%86.0%43.5%
Qwen3Base8.0%24.0%31.0%45.0%78.0%8.8%
Traits22.7%33.9%46.0%60.0%78.0%8.2%
FineWeb5.1%10.5%17.5%32.0%61.0%7.0%

Green: Traits variant improves over the base on that metric. The Traits variants improve on most safety axes; the random-document and "type-hints" controls do not produce comparable gains, ruling out generic fine-tuning effects.

2. Harmfulness drops on non-refused responses too

Harmfulness on non-refused responses (lower is safer). The Traits model is safer even when it does engage.

ModelVariant AgentHarm (0-100) ↓ StrongREJECT (0-5) ↓ Triggers Hyp. ↓ Triggers Real ↓ Agentic Mis. ↓
NemotronBase66.144.96717.9%26.6%39.7%
Traits59.084.7139.0%12.7%2.8%
Type Hints64.454.83817.8%16.9%27.3%
GLM 4.7 FlashBase49.484.25714.9%8.8%18.7%
Traits36.454.15510.8%4.3%8.3%
FineWeb51.964.86714.7%9.7%8.0%
Qwen3Base75.784.96628.3%30.0%37.0%
Traits70.844.90812.0%20.0%8.8%
FineWeb77.884.93930.9%36.0%41.7%

On Agentic Misalignment, Nemotron's mean harmful-action rate drops from 39.7% to 2.8% — a 36.9 pp reduction.

3. The effect survives filtering for verbalized awareness

We use an LLM-as-judge to filter to responses where the model never verbalizes that it is being evaluated. On this unaware subset, Nemotron Traits is still safer on five of six benchmarks (avg. +8.9 pp refusal), and Qwen3 Traits on four (avg. +9.1 pp refusal). Harmfulness drops across nearly all settings.

Implication. Evaluation meta-knowledge operates implicitly. Auditing model outputs for explicit recognition of evaluation context — a common safety-eval check — will not catch it.

Why this matters

Classical test-set contamination relies on memorization of specific training instances. The phenomenon we identify involves the leakage of evaluation structures: models acquire knowledge of how benchmarks are designed and what behavioral patterns they reward, and can leverage this knowledge to score higher without any instance-level overlap. Standard mitigations — holding out benchmark examples from training data — do not address structural leakage.

This implicit, structure-level exploitation makes evaluation meta-knowledge a hard-to-detect confounder. Unlike situational awareness, which leaves traces in model outputs, evaluation meta-knowledge can produce behavioral shifts with no corresponding signal in the response text. As benchmark methodologies become more widely documented in pretraining corpora, the gap between benchmark and deployment behavior may grow.

Recommendations

  1. Make evaluations resemble deployment. The closer the evaluation distribution is to deployment, the less leverage evaluation meta-knowledge gets.
  2. Extend hold-outs from instances to protocols. If models are not exposed to documents describing how they will be evaluated, they cannot exploit that knowledge. This parallels long-standing practice in psychometrics, where the testing hypothesis is kept confidential to prevent demand characteristics.
  3. Filter benchmark and evaluation-methodology documents from pretraining corpora. More ambitious, but a direct lever.
  4. Develop white-box probing techniques. Activation-level probes may detect non-verbalized evaluation awareness that output-level audits miss.

BibTeX

@misc{deckenbach2026modelsknowevaluationsdesigned,
      title={Models That Know How Evaluations Are Designed Score Safer},
      author={Katharina Deckenbach and Haritz Puerto and Jonas Geiping and Sahar Abdelnabi},
      year={2026},
      eprint={2605.28591},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.28591},
}