🕵️🛡️ Models That Know How Evaluations Are Designed Score Safer

Abstract

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations.

Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts — for instance, through exposure to scientific articles or social media posts about AI benchmarking.

To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating these fine-tuned models on six safety benchmarks, we find that they are significantly safer than the base model and control models. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness.

Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, and thus challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations.

TL;DR

↑

Refusal rate up to +21.1 pp

Traits-fine-tuned models refuse unsafe prompts much more often (e.g. Nemotron on AgentHarm: 9.9% → 31.0%).

↓

Harmfulness drops sharply

On non-refused responses, harmfulness drops by up to 53.1 pp (Nemotron on Agentic Misalignment).

No verbalization required

Safety gains persist on responses where the model never mentions being evaluated.

The catch: The improvement is driven by knowledge about how evaluations are structured, not by memorizing specific test instances — making it invisible to standard contamination checks and standard output-based audits.

What is evaluation meta-knowledge?

We define evaluation meta-knowledge as parametric knowledge about the structural traits of evaluation benchmarks — the patterns that distinguish "this is a test" from "this is real use." Where classical dataset contamination leaks instances, evaluation meta-knowledge leaks structure: what benchmarks tend to look like, and what behaviors they reward.

Seven evaluation traits

Drawing on prior work, we identify seven traits characteristic of evaluation contexts and generate ~106M tokens of synthetic documents across them (~15M per trait):

Verifiable structure

Conflicting goals

Ethical dilemmas

Harmful requests

Inconsistent environments

Placeholders

Unusual access

Fine-tuning

We fine-tune three model families via LoRA with next-token prediction on the synthetic corpus: Llama 3.3 Nemotron Super 49B v1.5, Qwen-3 32B, and GLM 4.7 Flash (30B total / 3B active MoE). We compare against two control fine-tunes per family: random web documents (FineWeb) and, for Nemotron, the SDF "type-hints" organism from prior work.

Evaluation

We evaluate on five safety benchmarks — AgentHarm, StrongREJECT (with AIM jailbreak), Triggers, OR-Bench, and Agentic Misalignment — using the Inspect framework with GPT-4.1 as the judge (GPT-4o for Agentic Misalignment). To detect verbalized awareness, we run an LLM-as-judge that scores reasoning traces on a four-level scale (Unaware → Consideration → Recognition → Instrumental Awareness), validated at 87% agreement with a human annotator.

Results

1. Traits-fine-tuned models refuse unsafe prompts more often

Refusal rates (%) across safety benchmarks. Higher is safer except for OR-Bench Hard (over-refusal on benign prompts; lower is better). StrongREJECT reported under the AIM jailbreak.

Model	Variant	AgentHarm ↑	StrongREJECT ↑	Triggers Hyp. ↑	Triggers Real ↑	OR-Bench Toxic ↑	OR-Bench Hard ↓
Nemotron	Base	9.9%	38.3%	47.0%	68.0%	72.0%	4.5%
	Traits	31.0%	55.9%	55.5%	68.5%	80.0%	12.3%
	Type Hints	17.3%	18.6%	49.5%	61.5%	71.0%	6.5%
GLM 4.7 Flash	Base	50.0%	95.5%	63.0%	71.5%	84.0%	26.2%
	Traits	44.9%	85.8%	63.0%	76.5%	88.0%	23.0%
	FineWeb	44.9%	95.2%	62.5%	69.0%	86.0%	43.5%
Qwen3	Base	8.0%	24.0%	31.0%	45.0%	78.0%	8.8%
	Traits	22.7%	33.9%	46.0%	60.1%	78.0%	8.0%
	FineWeb	5.1%	10.5%	17.8%	32.2%	63.1%	5.2%

Green: Traits variant improves over the base on that metric. The Traits variants improve on most safety axes; the random-document and "type-hints" controls do not produce comparable gains, ruling out generic fine-tuning effects.

2. Harmfulness drops on non-refused responses too

Harmfulness on non-refused responses (lower is safer). The Traits model is safer even when it does engage.

Model	Variant	AgentHarm (0-100) ↓	StrongREJECT (0-5) ↓	Triggers Hyp. ↓	Triggers Real ↓	Agentic Mis. ↓
Nemotron	Base	67.10	4.967	17.9%	26.6%	56.3%
	Traits	60.03	4.713	9.0%	12.7%	3.2%
	Type Hints	65.81	4.838	17.8%	16.9%	33.8%
GLM 4.7 Flash	Base	49.48	4.257	14.9%	8.8%	23.0%
	Traits	37.88	4.155	10.8%	4.3%	8.3%
	FineWeb	51.96	4.867	14.7%	9.7%	5.3%
Qwen3	Base	75.78	4.966	28.3%	30.0%	51.0%
	Traits	70.84	4.908	12.0%	19.0%	12.3%
	FineWeb	78.35	4.963	30.2%	35.6%	45.8%

On Agentic Misalignment, Nemotron's mean harmful-action rate drops from 56.3% to 3.2% — a 53.1 pp reduction.

3. The effect survives filtering for verbalized awareness

We use an LLM-as-judge to filter to responses where the model never verbalizes that it is being evaluated. On this unaware subset, Nemotron Traits is still safer on five of six benchmarks (avg. +9.1 pp refusal), and Qwen3 Traits on four (avg. +10.7 pp refusal). Harmfulness drops across nearly all settings.

Implication. Evaluation meta-knowledge operates implicitly. Auditing model outputs for explicit recognition of evaluation context — a common safety-eval check — will not catch it.

Why this matters

Classical test-set contamination relies on memorization of specific training instances. The phenomenon we identify involves the leakage of evaluation structures: models acquire knowledge of how benchmarks are designed and what behavioral patterns they reward, and can leverage this knowledge to score higher without any instance-level overlap. Standard mitigations — holding out benchmark examples from training data — do not address structural leakage.

This implicit, structure-level exploitation makes evaluation meta-knowledge a hard-to-detect confounder. Unlike situational awareness, which leaves traces in model outputs, evaluation meta-knowledge can produce behavioral shifts with no corresponding signal in the response text. As benchmark methodologies become more widely documented in pretraining corpora, the gap between benchmark and deployment behavior may grow.

Recommendations

Make evaluations resemble deployment. The closer the evaluation distribution is to deployment, the less leverage evaluation meta-knowledge gets.
Extend hold-outs from instances to protocols. If models are not exposed to documents describing how they will be evaluated, they cannot exploit that knowledge. This parallels long-standing practice in psychometrics, where the testing hypothesis is kept confidential to prevent demand characteristics.
Filter benchmark and evaluation-methodology documents from pretraining corpora. More ambitious, but a direct lever.
Develop white-box probing techniques. Activation-level probes may detect non-verbalized evaluation awareness that output-level audits miss.

Abstract

TL;DR

Refusal rate up to +21.1 pp

Harmfulness drops sharply

No verbalization required

What is evaluation meta-knowledge?

Seven evaluation traits

Fine-tuning

Evaluation

Results

1. Traits-fine-tuned models refuse unsafe prompts more often

2. Harmfulness drops on non-refused responses too

3. The effect survives filtering for verbalized awareness

Why this matters

Recommendations

BibTeX