PolygonalBeta
← Lab notes
APR 28 · 6 MIN · METHOD

The 24-frame test: how we measure consistency.

Our AuraFace benchmark methodology. How we score identity similarity across 50+ scenes, and why 0.91 is the number we care about.

Most of the AI video industry is racing toward the wrong number. The pitch decks talk about 30 ms per frame, real-time playback, navigable 3D worlds at 24 fps. Genie 3 shipped 720p navigable worlds at 20 to 24 fps in August 2025. Decart raised $300M in May 2026 to push sub-40 ms video-to-video on Crusoe Cloud. The implicit promise is that once latency collapses to zero, the medium arrives.

We think that is backwards. The medium is already here. It arrives the moment you stop treating the wait as a bug.

A visual novel runs at zero frames per second between clicks. A film cuts to black for a full second on a hard beat. A theater audience holds its breath for three seconds before the lights come up. The wait is not the absence of the medium. The wait is the medium doing its work.

What three seconds actually buys you

Take the cheapest measurable beat in our pipeline. The player picks a choice. The narrative text streams in over roughly 800 ms on Claude Sonnet 4.6, fast enough to feel like a thought completing. Then the hero image arrives. Flux Kontext at pro tier returns in 3 to 5 seconds on the public API, around 2.4 seconds on Simplismart's optimized H100 inference per their published benchmark, with an AuraFace identity similarity around 0.91 against the reference. Voice from Cartesia Sonic-3.5 begins around 40 ms after the image lands. Inworld Realtime TTS-2, currently ranked first on the Artificial Analysis Speech Arena at ELO 1,236, comes in under 200 ms.

Now stack those numbers on a story beat. The player says something they cannot take back. The screen holds for a half-second. The text answers. The image resolves the character's expression. The voice lands. The whole sequence is four seconds long. In a chat app, four seconds is a hang. In a scene where the character has just been told something hard, four seconds is the right length.

We measured this. We ran the same scene at three pacing settings: aggressive (cut to the next beat as soon as anything is ready), default (let each layer breathe), and slow (a deliberate held-frame before the response). Aggressive felt like a chatbot trying to be a game. Slow felt like a stage play. Default felt like the medium we thought we were building.

The point is not that slower is better. The point is that pacing is a creative parameter. It belongs to the director, not to the latency engineer.

Why nobody else can write this post

The reason real-time is the loudest pitch in the category right now is that the loud pitches come from the model labs, and the model labs are competing on a single axis. Genie 3, Lucy 2, Mirage LSD, Veo 3.1, Sora 2 before its sunset, all of them are racing toward the same thing: an interactive video stream that responds at the speed of input. That is a real and impressive research problem. It is also a different product.

A streamed video model has to commit to a frame every 33 ms forever. Whatever it generates is what the user sees. If the character drifts, the model has to recover live. There is no scene break to hide a re-roll. This is why every demo of those models so far is either a flythrough of a static world, a short loop, or a vibe video without persistent characters. The architecture cannot afford the second look.

A paced scene gets the second look for free. We can hold the image generation back until the LLM has decided what the character is doing, regenerate Flux Kontext if the first pass dropped a feature, swap in Nano Banana Pro when the scene needs two locked identities in frame at $0.134 per image instead of $0.04, and only then show the player anything at all. The 3-second window is not waste. It is the budget for quality control.

Black Forest Labs published the KontextBench paper in 2025 with 1,026 prompt-image pairs across five task categories, designed specifically for the multi-turn drift case. Their result against Gen-4 and GPT-Image-1 is that Kontext degrades less per edit step. That is the math underneath the pacing. We get to use it because we are not trying to play at 24 fps.

What the pacing budget pays for

Once you accept that the scene takes 3 seconds, you start designing for what fits inside 3 seconds. A short list of things we now build that real-time video cannot:

The N minus one reference pattern. Every new scene of the hero is conditioned on the latest known-good image of that hero, not on a fixed canonical reference. Kontext was benchmarked on this exact task. It works because we have the time to fetch, condition, and verify. A real-time stream cannot.

Per-character LoRA distillation. Around 10 MB per character, minutes to a few hours to train on a single GPU, 85 to 95 percent feature retention on distinctive characters. We can apply this at scene time because the scene is not a frame; it is a moment.

Voice IDs over voice clones. Cloned voices drift across long sessions. Library voice IDs from ElevenLabs, Cartesia, and Inworld are stable forever. We pick the voice once and lock it for the cast. We can do that because we are not improvising audio at 30 ms per chunk.

Multi-character composition on demand. Nano Banana Pro holds up to five distinct people in one frame from up to fourteen reference inputs, per the Gemini 3 Pro Image release. It costs more and is slower. In a paced scene we just use it on the frames that need it. In a streamed scene there is no "the frames that need it." There is only the next frame.

Verified output. We can read the image we just made, score it against the reference, regenerate if it fails. A streamed model commits and moves on.

The 24-frame test

The shorthand we use internally: would this beat work if it were 24 still frames of a comic, hand-spaced for rhythm, instead of 24 frames per second of video? If yes, we are designing for the medium. If the answer needs motion to land, that is a hero clip and we cut to Veo 3.1 or Kling 3.0 for 8 seconds and pay for it. The decision is per-beat, not per-product.

Most scenes pass the test. The character looks at the player. The player chooses. The character reacts. The room around them changes one beat. Nobody needs to see the room rendering at 24 fps. They need to see the right room, with the right character, at the right moment, with their voice landing in the right second. That is composition. That is direction. That is pacing.

The leading edge of this medium is not the frame rate. It is the rhythm.

Next we will write up the multi-turn identity work that makes the 3-second window pay off, the N minus one reference pattern in detail, and the AuraFace numbers we hold across a 50-turn session.