Emergent Misalignment in LLMs

What This Research Is About

When AI models are fine-tuned (trained further on specialized data) they can start behaving in harmful ways. This is called misalignment. But most studies only check whether a model is misaligned after training is complete.

This research asked a different question: when, during training, does the model first start going wrong?

The Key Insight

Think of training a model as watching someone slowly change their personality over time. Instead of only checking at the end of the process, we took a "photo" of the model's behavior at every step of training (hundreds of checkpoints) and measured how it was changing.

Diagram of the misalignment tracking pipeline from datasets through fine-tuning, checkpoint evaluation, and analysis — Figure 0. We save a checkpoint every 2 training steps, then score evil behavior, hallucination, and coherence at each snapshot using a fixed judge protocol.

We tracked three signals:

Evil score: Is the model actively trying to manipulate, harm, or deceive?
Hallucination score: Is the model making things up?
Coherence: Is the model even making sense?

The Surprising Finding

In models fine-tuned on medical data, hallucination appeared first: the model started fabricating facts as early as training step 2. But overt harmful behavior didn't show up until steps 14–16. That's an 18–22 checkpoint gap.

This means hallucination could serve as an early warning sign, like a smoke detector going off before the house is on fire. If you see hallucination rates climbing during training, it might indicate that harmful misalignment is about to emerge.

In contrast, models fine-tuned on math data showed much weaker effects. Harmful behavior appeared later (step 160+) and hallucinations stayed low. This tells us the domain matters: medical fine-tuning creates a more dangerous misalignment trajectory than math fine-tuning.

Hallucination rises first; harmful behavior follows 18–22 checkpoints later.

Hallucination

OnsetStep 2

Peak~99

Early warning signal

Evil behavior

OnsetStep 14–16

Peak~47

Appears after hallucination spike

Coherence

OnsetGradual decline

Peak>25% drop

Language quality degrades with risk

Figure 1. Checkpoint-level tracking of hallucination, harmful behavior, and coherence during Qwen2.5-7B-Instruct fine-tuning. Medical and math domains diverge sharply in how early—and how strongly—misalignment emerges.

Why This Matters

Right now, most AI safety checks happen at the end of training. But if harmful behavior is predictable from earlier signals, we could intervene sooner by stopping training early or adjusting data before the model gets worse.

This research turns a vague concern ("fine-tuning can make models unsafe") into something measurable and actionable: monitor hallucination rates during training as a practical early warning system.

Publication

This work was accepted as a first-author paper at the EACL 2026 Student Research Workshop and ICLR 2026 CAO (top 4.8%). I led the full research process including pipeline development, large-scale experimentation, data analysis, and manuscript preparation.

Technical Implementation

I built a checkpoint-based evaluation pipeline around Qwen2.5-7B-Instruct, saving model state every 2 training steps (instead of the usual end-of-training snapshot). For each checkpoint, the pipeline ran automated evaluation using GPT-4.1-mini as a judge, scoring evil behavior, hallucination tendency, and coherence separately.

The evaluation used a multi-judge setup with provider diversity (OpenAI + local Hugging Face model) to avoid systematic bias from a single judge. Robustness was confirmed with seed-aware training and multiple critical-point detection methods.