Real videos follow physics. AI-generated videos almost — but not quite — do. SURPRISE is a tiny world model that flags the gap, frame by frame, in milliseconds.
Awaiting input
Each frame becomes a compact 192-dimensional vector via a small Vision Transformer. The encoder learns to keep only what's needed to predict the future, throwing away irrelevant texture and lighting noise.
A second network looks at the current latent z_t and predicts the next one, ẑ_{t+1}, autoregressively. It's learned what physically plausible motion looks like by watching thousands of hours of real video.
The gap between predicted and actual latent is the surprise score. Real video has low surprise — physics is consistent. AI-generated video tends to spike: subtle warps, flicker, hand glitches register as "the world didn't behave as expected."
A clip's average surprise plus its temporal straightness (how smoothly the latent path bends through time) become the verdict signal. Real footage moves in nearly straight latent paths. Generated video wobbles.
SURPRISE is built on LeWorldModel (Maes et al., 2026) — the first Joint-Embedding Predictive Architecture that trains stably end-to-end from raw pixels with no exponential moving averages, no stop-gradient, no pretrained encoder.
The original paper showed something unexpected: a tiny model trained only to predict the future develops genuine physical intuition. It detects when objects teleport. It probes positively for spatial position and rotation. It encodes time as nearly straight paths in latent space — the same property neuroscientists observe in the human visual cortex.
We applied this to video forgery. AI-generated video is, almost by definition, a physical violation. The model treats it the way it treats a teleporting cube — with elevated surprise.
No detector is. State-of-the-art video generators improve every month, and our model is trained on a fixed snapshot of real and generated content. We aim for AUC > 0.85 on standard benchmarks (FaceForensics++, DFDC), but novel generators will sometimes evade detection. Treat surprise scores as evidence, not proof.
Modern generators have largely solved single-frame artifacts. The remaining tells are temporal — physics that's almost-but-not-quite right across frames. Latent surprise captures this directly because it's literally measuring how much the model's expectations break frame-to-frame.
Faces are one of our strongest test cases because face dynamics are heavily constrained — heads move smoothly, blinking has a rhythm, lighting follows physics. Diffusion-generated faces violate these in subtle ways our model learns to flag. We perform best on face videos, then human motion, then unconstrained scenes.
Yes — like any detector, ours can be adversarially attacked. The most concerning vector is generators trained to minimize surprise scores. We don't have a complete defense, but the underlying physics constraints are hard to fully satisfy with current architectures, which buys time.
This is a real failure mode. Slow-motion footage, heavily-edited content, and stylized/animated video can score as "surprising" simply because they don't match the natural-video distribution we trained on. We recommend running detection only on content that claims to be unedited real-world footage.
Yes — checkpoints and inference code will be released alongside our writeup once we've validated benchmark numbers. The training code is already open via the stable-worldmodel library from the LeWorldModel authors.