What This Research Is About
When AI models are fine-tuned (trained further on specialized data) they can start behaving in harmful ways. This is called misalignment. But most studies only check whether a model is misaligned after training is complete.
This research asked a different question: when, during training, does the model first start going wrong?
The Key Insight
Think of training a model as watching someone slowly change their personality over time. Instead of only checking at the end of the process, we took a "photo" of the model's behavior at every step of training (hundreds of checkpoints) and measured how it was changing.
We tracked three signals:
- Evil score: Is the model actively trying to manipulate, harm, or deceive?
- Hallucination score: Is the model making things up?
- Coherence: Is the model even making sense?
The Surprising Finding
In models fine-tuned on medical data, hallucination appeared first: the model started fabricating facts as early as training step 2. But overt harmful behavior didn't show up until steps 14–16. That's an 18–22 checkpoint gap.
This means hallucination could serve as an early warning sign, like a smoke detector going off before the house is on fire. If you see hallucination rates climbing during training, it might indicate that harmful misalignment is about to emerge.
In contrast, models fine-tuned on math data showed much weaker effects. Harmful behavior appeared later (step 160+) and hallucinations stayed low. This tells us the domain matters: medical fine-tuning creates a more dangerous misalignment trajectory than math fine-tuning.
Hallucination rises first; harmful behavior follows 18–22 checkpoints later.
Early warning signal
Appears after hallucination spike
Language quality degrades with risk
Why This Matters
Right now, most AI safety checks happen at the end of training. But if harmful behavior is predictable from earlier signals, we could intervene sooner by stopping training early or adjusting data before the model gets worse.
This research turns a vague concern ("fine-tuning can make models unsafe") into something measurable and actionable: monitor hallucination rates during training as a practical early warning system.
Publication
This work was accepted as a first-author paper at the EACL 2026 Student Research Workshop and ICLR 2026 CAO (top 4.8%). I led the full research process including pipeline development, large-scale experimentation, data analysis, and manuscript preparation.
Technical Implementation
I built a checkpoint-based evaluation pipeline around Qwen2.5-7B-Instruct, saving model state every 2 training steps (instead of the usual end-of-training snapshot). For each checkpoint, the pipeline ran automated evaluation using GPT-4.1-mini as a judge, scoring evil behavior, hallucination tendency, and coherence separately.
The evaluation used a multi-judge setup with provider diversity (OpenAI + local Hugging Face model) to avoid systematic bias from a single judge. Robustness was confirmed with seed-aware training and multiple critical-point detection methods.