Making

We launched AccelaStudy® AI today. The teaser you may have seen on Friday — the one that ends with "What if?" — was made over the course of about 36 hours, mostly by two people, almost entirely with off-the-shelf AI tools paid by the credit. This post is for anyone curious about what the actual workflow looked like, the prompts that didn't work included.

I'm writing this from my own perspective, not the company's. The corporate version of this post lives on Renkara and the marketing version is on AccelaStudy. They tell roughly the same story with different emphasis.

I almost wrote "education is broken"

The original draft of the script started with "Education is broken." I wrote that line and then sat with it for ten minutes before I threw the whole thing out. Every ed-tech ad I've ever seen opens with some variation of that line. Khan said it. Coursera said it. Every TED talk on learning says it. By the time you've heard it, you've already filed the next sixty seconds as "another ed-tech ad."

I rewrote with a concrete image. Three short statements:

Twenty kids. One classroom. One pace.

That tells the audience the same thing — the system is broken — without telling them they're being marketed to. The visual does the work.

The full final script went through five revisions during production. The biggest single change was a one-word swap: "That's not your fault" became "You're not the problem." Both lines mean the same thing structurally. The difference is that the first has a vague antecedent — fault for what? — and the second directly rebukes the system framing in the lines before it. It's also two syllables shorter, which matters when you're working on a 60-second cut.

The protagonist

The teaser uses one face across five shots. For those shots to read as the same person across different lighting and emotional contexts, I needed a stable character reference up front.

I generated five candidates with Flux 1.1 Pro Ultra ($0.30 total, ten seconds per image, parallel). Brief was deliberately ungroomed: real-looking 15-16 year old, mixed-ethnicity, soft natural light from one side, no makeup, no glamour shot framing. Flux defaulted to female across all five candidates. I picked one.

That image then became the input for Flux Kontext Max every time the protagonist needed to appear in a different scene — confused at her desk, looking down in despair, the warm-light close-up at the end. Without Kontext, the same character would have drifted into five different people across five shots.

This is the workflow I'd repeat every time. Reference image first, then character lock, then animate. It costs almost nothing in API credits and saves entire days of iteration trying to make the model produce the same person twice.

The shots and the ones that wouldn't behave

Sixteen shot positions, with sixteen-plus distinct stills generated. Each got at least three rounds of iteration.

Shot 1 — the establishing wide of the institutional classroom — took five passes to get the teacher cleanly in the central walkway and not standing on top of a student. Each iteration I made the prompt more explicit ("teacher at the front of the classroom on a raised platform clearly separated from the students by empty floor space"). The fix that finally worked was specifying the geometry: "teacher is positioned at the very front of the room only, well away from any students, NOT among them."

Shot 2 — the kid in the back row — took seven versions to get the spatial position to read. The first three takes had him in close-up with no spatial context (could have been any classroom seat). The fourth had the camera positioned wrong, so his desk appeared to face the wrong direction. The seventh was right: 3/4 front framing with a clearly visible back wall close behind him, classmates blurred in the foreground (in front of him, from his perspective), wall clock as the only institutional cue. Seven takes. Just to establish where one student was sitting.

Shot 3: front-row protagonist holding pencil aloft

The shot that taught me the biggest lesson was Shot 3 — the front-row protagonist who's lost. The script beat is: she's holding a pencil, frozen in confused panic, then drops it in frustration. We generated the still. We generated the video with very strong negative prompts: writing, scribbling, hand moving, pencil moving on paper, taking notes, drawing. Every single take, Kling animated her writing.

I wrote the prompt stronger. I added more negation. Nothing worked.

The fix wasn't the prompt. The fix was the start frame. I regenerated the still with the pencil clearly held vertically in the air, several inches above the desk, eraser pointing up, lead pointing down. Once the geometry was right, the only motion that made physical sense was the drop. The model can only animate what the start frame allows.

I now generate stills before I generate videos. Every time. No exceptions.

The diversity problem

The first round of background students was almost entirely white men. I added "balanced mix of young men and young women in equal proportion, multiple ethnicities including Black, white, Asian, Latino" to the prompt. We got better — clearly visible Black students, mixed gender — but never quite balanced. Image models reflect the training data; explicit prompting helps but doesn't override the prior. This is a known problem and not one I solved here. It's something to be aware of every time you generate a crowd.

The Pixar incident

One iteration of Shot 10 (the home-learning teen) came back as a 9-year-old in stylized 3D. Full Pixar tonality. We were nowhere near asking for animation.

The cause was one word in the prompt: "a faint reflection of an animated lesson in their eyes." The word "animated" tipped the entire image into a 3D-rendered aesthetic. Stripping that one word and adding "photorealistic film still, real human, naturalistic skin texture" fixed it instantly.

I'd been writing prompts for image generators for two years and still walked into this trap. Words have weight even when they're not the subject of the sentence.

Audio: the part where I had to record myself

Music was three rounds on ElevenLabs Music at about 50 cents per generation. The prompt was a three-act structure with explicit timing markers. The first track had a built-in ramp-up that fought our visual fade-in. The second was better. The third honored the explicit instruction: "START AT FULL DENSITY AND VOLUME from the very first frame at 0:00 — NO ramp-up, NO swell-in, NO crescendo intro." Even then I atrim'd the first 3 seconds of the file in post, just in case the music wanted to do something quiet at the open.

Narration was harder. We tried four ElevenLabs voices, generated full takes with each, and they were all... fine. Cinematic, professional, hits the marks. But the AI voice doesn't hesitate. It doesn't take an extra millisecond on a word that matters. The script needed someone who would read "What if?" and let the words sit.

So I recorded the narration myself. Twelve files, one per line. SM7B at six inches off-axis (kills plosives, captures the full warmth). Take three or four for most lines. The opening line — "Twenty kids. One classroom. One pace." — got nine takes. Three short statements with a specific cadence is harder than it looks.

Then I ran every file through a cleanup chain:

ffmpeg -i in.mp3 -af "highpass=f=70,
  afftdn=nr=10:nf=-22:tn=1,
  adeclick,
  deesser=i=0.4:m=0.5:f=0.5,
  acompressor=threshold=-20dB:ratio=2.5:attack=5:release=80,
  alimiter=limit=0.95:level=disabled" \
  -c:a libmp3lame -b:a 192k out.mp3

This costs zero, runs in under a second per file, and saved me probably four hours of re-recording. The chain is: kill rumble below 70Hz, denoise the room tone, remove mouth clicks and pops, soften the "s" sounds, compress for line-to-line evenness, limit the peaks to prevent clipping. I then normalized every cleaned line to match the level of the first one — so the listener doesn't get pulled around as the cut moves between voices.

We kept both versions. There's a side-by-side comparison if you want to vote on which sounds better. I genuinely don't know which one's right and I want the data.

The end card

LAUNCHING / MONDAY in stacked typography. Plus Jakarta Sans Bold (AccelaStudy's brand font, lifted directly from our design system tokens). Cold blue #a8c4d8 on black.

The animation is what makes it. Black hold for 1.3 seconds after "What if?" lands. LAUNCHING fades in over half a second. MONDAY fades in 300 milliseconds behind LAUNCHING. Both hold to the end of the cut.

That 300ms stagger between LAUNCHING and MONDAY is the difference between a static title card and a beat. Without it, both words land at the same moment and the eye reads them as a single unit. With the stagger, the eye gets pulled to the date — which is the entire point of the title plate.

I built the plate with PIL (transparent text layers) plus ffmpeg fade filters. About 30 lines of Python. The whole title plate sequence is generated fresh on every build, so changing the timing is a parameter swap, not a re-render.

What this whole thing cost

Reference image generation: $0.30
Character-locked stills (~50 across iterations): ~$4
Video generation (~25 Kling 3 generations): ~$70
Music (3 tracks): ~$1.50
Narration TTS (B version comparison): ~$2
Total external API spend: ~$80

Plus the human time: about 36 hours from first script draft to shipped cut. Two people. The slow part was iteration loops — getting the pencil orientation right took half a day, getting the back-row composition right took most of an afternoon. Total compute was well under an hour.

A traditional production for a 60-second teaser at this quality level — agency, location, talent, post-production — would be in the $50,000 to $200,000 range. We did this for $80 and the time of the people involved.

That's the leverage I keep writing about. The tools have arrived.

What I'd do differently

Two things, both small, both mostly already covered above:

Generate stills before generating video, every time. I learned this on Shot 2 the hard way. The first three video iterations all had the kid in the wrong spatial position because I was letting Kling guess the composition from a text prompt. Once I switched to image-to-video with deliberate stills, the bad takes stopped almost entirely.

Cleanup costs nothing. I almost re-recorded several narration lines because I thought they had too much room tone. The ffmpeg chain above ran in under a second per file and saved me hours. If you're recording your own VO at home, the cleanup pass is non-optional.

Stack

Reference images: Flux 1.1 Pro Ultra (Replicate)
Character lock: Flux Kontext Max (Replicate)
Video animation: Kling 3 Video (Replicate)
Music: ElevenLabs Music
Narration TTS: ElevenLabs Voice (for B-version comparison only)
Audio cleanup + assembly: ffmpeg
Title plate: PIL + ffmpeg fade filter
Build orchestration: ~280 lines of Python
Brand font: Plus Jakarta Sans, from the AccelaStudy design system

All on Replicate or directly via vendor APIs. No agency, no rendering farm.

Watch it

AccelaStudy® AI launches Monday. We'll see if it works.