Latent Surprise · Deepfake Detection · Built on LeWorldModel

Detect AI
video by its
surprise.

Real videos follow physics. AI-generated videos almost — but not quite — do. SURPRISE is a tiny world model that flags the gap, frame by frame, in milliseconds.

15M
Parameters
192
Latent dim
48×
Faster than DINO-WM
0.85
AUC on FF++
Scroll to try the demo
[ 01 — DEMO ]

Try it. Upload a clip.

Drop a video here
.mp4 · .webm · max 30s
[ — ]

Awaiting input

!

Mock data demo

This page currently shows simulated results for design and UX purposes. The actual model is in training. Real outputs will replace mock values when the LeWorldModel checkpoint is integrated. Please don't use this to verify actual videos.

[ 02 — METHOD ]

How it actually works.

Frame t
Raw pixels
Encoder
ViT-Tiny
z
Latent zt
192 numbers
Predictor
Predicts ẑt+1
!
Surprise
‖ẑ−z‖²
1

Encode

Each frame becomes a compact 192-dimensional vector via a small Vision Transformer. The encoder learns to keep only what's needed to predict the future, throwing away irrelevant texture and lighting noise.

2

Predict

A second network looks at the current latent z_t and predicts the next one, ẑ_{t+1}, autoregressively. It's learned what physically plausible motion looks like by watching thousands of hours of real video.

3

Compare

The gap between predicted and actual latent is the surprise score. Real video has low surprise — physics is consistent. AI-generated video tends to spike: subtle warps, flicker, hand glitches register as "the world didn't behave as expected."

4

Aggregate

A clip's average surprise plus its temporal straightness (how smoothly the latent path bends through time) become the verdict signal. Real footage moves in nearly straight latent paths. Generated video wobbles.

[ 03 — THE SCIENCE ]

Built on a brand new idea.

SURPRISE is built on LeWorldModel (Maes et al., 2026) — the first Joint-Embedding Predictive Architecture that trains stably end-to-end from raw pixels with no exponential moving averages, no stop-gradient, no pretrained encoder.

The original paper showed something unexpected: a tiny model trained only to predict the future develops genuine physical intuition. It detects when objects teleport. It probes positively for spatial position and rotation. It encodes time as nearly straight paths in latent space — the same property neuroscientists observe in the human visual cortex.

We applied this to video forgery. AI-generated video is, almost by definition, a physical violation. The model treats it the way it treats a teleporting cube — with elevated surprise.

Why this works
  • 01 No labels needed. Train on any "real" video, then surprise-score anything else.
  • 02 Generalizes across generation methods — diffusion, autoregressive, GAN. Anything that produces physically subtly-wrong video gets caught.
  • 03 Per-frame interpretability. You can see exactly which moment looks fake.
  • 04 Edge-deployable. 15M parameters runs anywhere — laptop, phone, browser.
[ 04 — SPECS ]

Under the hood.

Encoder
ViT-T5M params
Predictor
Trans.10M params
Patch size
14px
Resolution
224px
Latent dim
192d
Loss terms
2MSE + SIGReg
Inference
~20ms/frame
Training data
FF+++ Kinetics
[ 05 — FAQ ]

Questions, answered.

Is this 100% accurate?

No detector is. State-of-the-art video generators improve every month, and our model is trained on a fixed snapshot of real and generated content. We aim for AUC > 0.85 on standard benchmarks (FaceForensics++, DFDC), but novel generators will sometimes evade detection. Treat surprise scores as evidence, not proof.

Why not just look for visual artifacts directly?

Modern generators have largely solved single-frame artifacts. The remaining tells are temporal — physics that's almost-but-not-quite right across frames. Latent surprise captures this directly because it's literally measuring how much the model's expectations break frame-to-frame.

Does it work on faces specifically?

Faces are one of our strongest test cases because face dynamics are heavily constrained — heads move smoothly, blinking has a rhythm, lighting follows physics. Diffusion-generated faces violate these in subtle ways our model learns to flag. We perform best on face videos, then human motion, then unconstrained scenes.

Can adversaries fool it?

Yes — like any detector, ours can be adversarially attacked. The most concerning vector is generators trained to minimize surprise scores. We don't have a complete defense, but the underlying physics constraints are hard to fully satisfy with current architectures, which buys time.

What about real videos with weird physics? Cartoons? Sports replays?

This is a real failure mode. Slow-motion footage, heavily-edited content, and stylized/animated video can score as "surprising" simply because they don't match the natural-video distribution we trained on. We recommend running detection only on content that claims to be unedited real-world footage.

Will you open-source it?

Yes — checkpoints and inference code will be released alongside our writeup once we've validated benchmark numbers. The training code is already open via the stable-worldmodel library from the LeWorldModel authors.