SURPRISE — AI Video Detection via Latent Surprise

[ 01 — DEMO ]

Try it. Upload a clip.

↑

Drop a video here

.mp4 · .webm · max 30s

[ — ]

Awaiting input

[ 02 — METHOD ]

How it actually works.

▢

Frame t

Raw pixels

→

∆

Encoder

ViT-Tiny

→

z
Latent zt
192 numbers

→

⟿

Predictor

Predicts ẑ_t+1

↔

!
Surprise
‖ẑ−z‖²

Encode

Each frame becomes a compact 192-dimensional vector via a small Vision Transformer. The encoder learns to keep only what's needed to predict the future, throwing away irrelevant texture and lighting noise.

Predict

A second network looks at the current latent z_t and predicts the next one, ẑ_{t+1}, autoregressively. It's learned what physically plausible motion looks like by watching thousands of hours of real video.

Compare

The gap between predicted and actual latent is the surprise score. Real video has low surprise — physics is consistent. AI-generated video tends to spike: subtle warps, flicker, hand glitches register as "the world didn't behave as expected."

Aggregate

A clip's average surprise plus its temporal straightness (how smoothly the latent path bends through time) become the verdict signal. Real footage moves in nearly straight latent paths. Generated video wobbles.

[ 03 — THE SCIENCE ]

Built on a brand new idea.

SURPRISE is built on LeWorldModel (Maes et al., 2026) — the first Joint-Embedding Predictive Architecture that trains stably end-to-end from raw pixels with no exponential moving averages, no stop-gradient, no pretrained encoder.

The original paper showed something unexpected: a tiny model trained only to predict the future develops genuine physical intuition. It detects when objects teleport. It probes positively for spatial position and rotation. It encodes time as nearly straight paths in latent space — the same property neuroscientists observe in the human visual cortex.

We applied this to video forgery. AI-generated video is, almost by definition, a physical violation. The model treats it the way it treats a teleporting cube — with elevated surprise.

Why this works

01 No labels needed. Train on any "real" video, then surprise-score anything else.
02 Generalizes across generation methods — diffusion, autoregressive, GAN. Anything that produces physically subtly-wrong video gets caught.
03 Per-frame interpretability. You can see exactly which moment looks fake.
04 Edge-deployable. 15M parameters runs anywhere — laptop, phone, browser.

[ 05 — FAQ ]

Questions, answered.

Is this 100% accurate?

No detector is. State-of-the-art video generators improve every month, and our model is trained on a fixed snapshot of real and generated content. We aim for AUC > 0.85 on standard benchmarks (FaceForensics++, DFDC), but novel generators will sometimes evade detection. Treat surprise scores as evidence, not proof.

Why not just look for visual artifacts directly?

Modern generators have largely solved single-frame artifacts. The remaining tells are temporal — physics that's almost-but-not-quite right across frames. Latent surprise captures this directly because it's literally measuring how much the model's expectations break frame-to-frame.

Does it work on faces specifically?

Faces are one of our strongest test cases because face dynamics are heavily constrained — heads move smoothly, blinking has a rhythm, lighting follows physics. Diffusion-generated faces violate these in subtle ways our model learns to flag. We perform best on face videos, then human motion, then unconstrained scenes.

Can adversaries fool it?

Yes — like any detector, ours can be adversarially attacked. The most concerning vector is generators trained to minimize surprise scores. We don't have a complete defense, but the underlying physics constraints are hard to fully satisfy with current architectures, which buys time.

What about real videos with weird physics? Cartoons? Sports replays?

This is a real failure mode. Slow-motion footage, heavily-edited content, and stylized/animated video can score as "surprising" simply because they don't match the natural-video distribution we trained on. We recommend running detection only on content that claims to be unedited real-world footage.

Will you open-source it?

Yes — checkpoints and inference code will be released alongside our writeup once we've validated benchmark numbers. The training code is already open via the stable-worldmodel library from the LeWorldModel authors.

Detect AI
video by its
surprise.

Try it. Upload a clip.

Mock data demo

How it actually works.

Encode

Predict

Compare

Aggregate

Built on a brand new idea.

Under the hood.

Questions, answered.

Detect AIvideo by itssurprise.

Try it. Upload a clip.

Mock data demo

How it actually works.

Encode

Predict

Compare

Aggregate

Built on a brand new idea.

Under the hood.

Questions, answered.

Detect AI
video by its
surprise.