← Back to research

Emergent Misalignment in LLMs

First-author research on how misaligned traits emerge and evolve during model training, accepted at EACL 2026 SRW and ICLR 2026 CAO (top 4.8%).

Led end-to-end research studying how hallucination and harmful behavior emerge and evolve across training checkpoints in Qwen2.5-7B-Instruct. Built the full evaluation pipeline, ran large-scale experiments, and authored the manuscript. Accepted at EACL 2026 Student Research Workshop and ICLR 2026 CAO.

Published

EACL 2026 Student Research Workshop · ICLR 2026 CAO (top 4.8%)

What we found
  • Hallucination rises before harmful behavior—18–22 checkpoints earlier in medical fine-tuning.
  • Medical misalignment trajectories are far steeper than math; the training domain shapes risk.
  • Checkpoint-level monitoring can flag drift before overt harmful behavior appears.

What This Research Is About

When AI models are fine-tuned (trained further on specialized data) they can start behaving in harmful ways. This is called misalignment. But most studies only check whether a model is misaligned after training is complete.

This research asked a different question: when, during training, does the model first start going wrong?

The Key Insight

Think of training a model as watching someone slowly change their personality over time. Instead of only checking at the end of the process, we took a "photo" of the model's behavior at every step of training (hundreds of checkpoints) and measured how it was changing.

Figure 0. We save a checkpoint every 2 training steps, then score evil behavior, hallucination, and coherence at each snapshot using a fixed judge protocol.

We tracked three signals:

  • Evil score: Is the model actively trying to manipulate, harm, or deceive?
  • Hallucination score: Is the model making things up?
  • Coherence: Is the model even making sense?

The Surprising Finding

In models fine-tuned on medical data, hallucination appeared first: the model started fabricating facts as early as training step 2. But overt harmful behavior didn't show up until steps 14–16. That's an 18–22 checkpoint gap.

This means hallucination could serve as an early warning sign, like a smoke detector going off before the house is on fire. If you see hallucination rates climbing during training, it might indicate that harmful misalignment is about to emerge.

In contrast, models fine-tuned on math data showed much weaker effects. Harmful behavior appeared later (step 160+) and hallucinations stayed low. This tells us the domain matters: medical fine-tuning creates a more dangerous misalignment trajectory than math fine-tuning.

Hallucination rises first; harmful behavior follows 18–22 checkpoints later.

Hallucination
OnsetStep 2
Peak~99

Early warning signal

Evil behavior
OnsetStep 14–16
Peak~47

Appears after hallucination spike

Coherence
OnsetGradual decline
Peak>25% drop

Language quality degrades with risk

Figure 1. Checkpoint-level tracking of hallucination, harmful behavior, and coherence during Qwen2.5-7B-Instruct fine-tuning. Medical and math domains diverge sharply in how early—and how strongly—misalignment emerges.

Why This Matters

Right now, most AI safety checks happen at the end of training. But if harmful behavior is predictable from earlier signals, we could intervene sooner by stopping training early or adjusting data before the model gets worse.

This research turns a vague concern ("fine-tuning can make models unsafe") into something measurable and actionable: monitor hallucination rates during training as a practical early warning system.

Publication

This work was accepted as a first-author paper at the EACL 2026 Student Research Workshop and ICLR 2026 CAO (top 4.8%). I led the full research process including pipeline development, large-scale experimentation, data analysis, and manuscript preparation.

Technical Implementation

I built a checkpoint-based evaluation pipeline around Qwen2.5-7B-Instruct, saving model state every 2 training steps (instead of the usual end-of-training snapshot). For each checkpoint, the pipeline ran automated evaluation using GPT-4.1-mini as a judge, scoring evil behavior, hallucination tendency, and coherence separately.

The evaluation used a multi-judge setup with provider diversity (OpenAI + local Hugging Face model) to avoid systematic bias from a single judge. Robustness was confirmed with seed-aware training and multiple critical-point detection methods.