Lighter Sync — A Simple Solution for Lip Sync
Not Everything Needs a Neural Network to Move Its Mouth
A while back I was working on a project that synced TTS audio with 3D avatar animation. The platform we were using could return viseme events and blend shape data alongside the speech, which was genuinely impressive. But the tech wasn't quite there yet for fully articulated human faces. It was close, and the team put a lot of work into bridging the gap, but there was still a fair amount of manual effort needed to get past what the client called "the uncanny valley".
It got me thinking. In cases where you don't need to care about the uncanny valley, where it doesn't need to look fully human, what's our easiest entry point? What keeps latency low whilst still being useful enough to humanise the rendering output, and gives people the freedom to use whichever tooling they want?
The "Everything Is AI" Hangover
We're at a weird moment right now. Every product roadmap has "AI-powered" plastered across it, and the default response to any problem is to reach for a model. Need text? LLM. Need speech? Neural TTS. Need the avatar to look like it's talking? Another model on top of the model you already have.
And look, I get it. The tooling is impressive. But there's a cost that nobody puts in the demo: latency, complexity, vendor lock-in, and the sheer fragility of chaining five cloud services together just to make a cartoon open and close its mouth, not to mention the environmental impact of running all those models to do things that you can potentially do in browser anyway.
So I took a step back. Instead of asking "which AI tool does lip-sync best?", I asked: what's the simplest thing that could actually work?
Cartoons Have Been Doing This for a Hundred Years
Here's the thing that snapped it into focus for me: most animated characters on the web don't use blend shapes. They don't need twenty-two viseme positions and sub-millimetre facial mesh deformation. They need a mouth that opens and closes roughly in time with the audio, maybe stretches into an "O" shape for round vowels, and occasionally raises an eyebrow for emphasis.
That's it. That's the entire spec for "good enough" lip-sync in a lo-fi animated context.
When you think about it, traditional animation figured this out ages ago. Classic cartoons map dialogue to about five or six mouth shapes, and nobody sits there thinking "hmm, that character's bilabial plosive articulation seems off." The simplicity is the charm. It reads as expressive because it's simplified, not despite it.
The more I thought about it, the more I realised this is true for a huge category of use cases: chatbot avatars, educational characters, interactive mascots, game NPCs, social media content, presentations. None of these need the fidelity of a Pixar rig. They need something that feels alive and roughly matches the audio.
What Already Exists (and Why I Didn't Use It)
There are existing tools in this space. Rhubarb Lip Sync is probably the most well-known, it's a solid piece of software that analyses audio and generates mouth cue files based on phoneme recognition. It handles multiple mouth shape standards (Preston Blair, Disney, etc.) and produces high-quality results. Think back to those old Sierra point and click games.
Then there's Rive, which is genuinely brilliant for interactive animation. Sure, if you're Duolingo and you've got a dedicated animation team building a flagship character, that's an amazing use case. But this isn't about building the best possible animation pipeline. It's about finding the low-hanging fruit and shifting those milliseconds of compute away from the cloud AI and into the browser.
Rhubarb is a native C++ application. It runs on your machine, as a dedicated binary. There's an unofficial WASM port floating around, but "unofficial" is doing a lot of heavy lifting there. If you're building a browser-based application and you want lip-sync that works for any TTS provider o r even for user-uploaded recordings, you're looking at either:
- Running a server-side process and dealing with the latency
- Shipping a WASM binary and hoping it holds up
- Locking into a specific TTS vendor that provides viseme data (hello, Azure)
None of those options felt right for what I wanted: a lightweight, zero-dependency library that runs entirely in the browser and doesn't care where the audio came from.
The Realisation: You Already Have the Script
Whilst it's on the horizon, and we are seeing it in some more expensive models, the reality is, widespread and commercialy viable mixed models with TTS (and expecially with viemes) just aren't a thing yet.
Here's the insight that made the whole thing click. In almost every TTS use case, you already have the script. You sent it to your TTS provider. You know exactly what words are being spoken. So why are we using AI to guess what's being said in the audio?
That's the core of what became LighterSync. Instead of trying to do speech recognition in reverse, we lean into what we already know: the text.
The pipeline is embarrassingly simple:
-
Text → Phonemes: Take the script, break it into words, look up each word's phonemes in a dictionary (or fall back to character-level rules). Map phonemes to viseme categories: Closed, Dental, Open, Round, Fricative. Six shapes. That's your lot.
-
Audio → Timing: Analyse the audio buffer using the Web Audio API. Calculate RMS amplitude in 20ms chunks. Find the segments where someone's actually talking versus where there's silence. Distribute the phoneme sequence across the speaking segments proportionally.
-
Playback → Interpolation: On each animation frame, look up where we are in the timeline, find the surrounding keyframes, and lerp between them for smooth transitions.
No cloud calls. No models. No WASM. Just the native Web Audio API, a dictionary lookup, and some maths.
The Cheat: Script-Informed Waveform Analysis
The clever bit (well, I thought it was clever) is in step two, We're not doing speech recognition. We're not trying to figure out what is being said. We already know that. What we need is when.
By analysing the audio's amplitude envelope, we can identify voice activity segments, the peaks, the troughs, the bits where the waveform isn't near silent. We know the total "weight" of our phoneme sequence (some sounds are naturally longer than others, round vowels hang around, plosives are quick). So we proportionally distribute the sequence across the speaking time:
Duration_phoneme = (Weight_phoneme / Total_Weight) × Total_Speaking_Time
Is this acoustically perfect? No. Is it good enough that a cartoon character's mouth moves convincingly in sync with speech? Yes. And that's the entire point.
Five Mouth Shapes Is All You Need
The simple mode maps everything down to five viseme categories plus silence:
| Category | Sounds | What the Mouth Does |
|---|---|---|
| Closed | p, b, m | Lips pressed together |
| Dental | t, d, n, s, z | Slightly open, tongue forward |
| Open | a, e, i, ah | Wide open |
| Round | o, u, w | Pursed, circular |
| Fricative | f, v | Bottom lip tucked under teeth |
| Silent | — | Neutral / closed |
Each gets an aperture (how open) and width (how wide) value between 0 and 1. Your renderer just needs to map those two numbers to whatever mouth shape representation you're using, whether that's scaling an ellipse, drawing on a canvas, or adjusting SVG paths.
If you want more fidelity, there's a full mode that outputs all 22 viseme IDs (kinda compatible with the Azure/Oculus standard). But honestly, for the use cases I'm targeting, the five-shape mode is more expressive, not less. Fewer states means smoother transitions and less visual noise.
It's Not Just the Mouth
One thing that really sells the illusion of a "talking" character isn't the lip-sync at all, it's everything else. Eyebrow raises on emphasis. Blinks during pauses. A slight squint during sustained loud passages. These microexpressions are what make a static character feel alive.
LighterSync generates these from the audio amplitude as well:
- Eyebrow raise: triggered when the RMS exceeds 1.5× the running average (natural emphasis detection)
- Squint: sustained loud passage > 0.5 seconds
- Blink: silence gaps > 300ms (turns out, humans blink during natural pauses in speech)
These are crude heuristics, but they're also remarkably effective. You don't need a sentiment model to figure out when someone's emphasising a word... we know in most cases, they just get louder.
The Dictionary Situation
For phoneme lookup, LighterSync ships with tiered slices of the CMU Pronouncing Dictionary, cross-referenced against the Google 10K English frequency list:
| Tier | Words | Size | Coverage |
|---|---|---|---|
| Small | ~500 | ~14 KB | ~80% of everyday speech |
| Medium | ~5,000 | ~177 KB | ~95% of written text |
| Full | ~126,000 | ~5 MB | Complete CMU dictionary |
You pick the tier that fits your bundle budget. Or you skip the dictionary entirely, there's a built-in character-level G2P fallback that does a decent job based on English spelling rules (consonant clusters, common digraphs, etc.). It won't nail every word, but for a character that's opening and closing its mouth to five shapes? It's more than adequate.
What This Means in Practice
Here's what I actually wanted, and what this now enables:
- Use any TTS provider, Google, AWS, OpenAI, ElevenLabs, Qwen, a local model, whatever. As long as you get an audio file and you know the script, you're set.
- Use your own recordings, voice actors, podcasts, lectures. If you've got audio and a transcript, it works.
- Run entirely in the browser, no server, no API keys, no latency. The analysis runs in an
OfflineAudioContextand completes in milliseconds. - Plug into any renderer, the library outputs normalised values. How you draw the mouth is your problem, and that's by design.
The demo uses a Zdog character, partly because I wanted an excuse to keep playing with Zdog after doing the talks.fyi mascot, but the same data could drive a CSS animation, an SVG face, a Three.js model, or a Lottie character.
The Trade-Offs (Because There Always Are)
I want to be upfront about what this isn't:
It won't replace proper lip-sync for high-fidelity 3D characters. If you're doing realistic facial animation, film VFX, AAA game cinematics, virtual human research, you need actual phoneme level alignment, blend shape targets, and probably a neural model. LighterSync is not trying to compete there.
The timing is approximate, not exact. Script informed alignment distributes phonemes proportionally across detected speech segments. It doesn't know that the speaker paused for 200ms in the middle of a sentence. For lo-fi use cases, this is a feature (it smooths things out). For high-fidelity use cases, it's a limitation.
The G2P fallback is English-only. The rule-based fallback and the CMU dictionary are both English. Multi language support would need language specific phoneme mappings, which is a different project (that being said, if you want to put in a PR, be my guest).
It doesn't handle non-speech audio. Music, sound effects, background noise — these will confuse the amplitude analysis. It expects speech audio, ideally clean speech audio.
The Bigger Point
This project is a small thing. A few hundred lines of TypeScript. But it's a reminder of something I keep coming back to: the best tool for the job isn't always the most sophisticated one.
We've built a bit of an industry habit of reaching for AI first and asking questions later. And I'm not anti AI, I use LLMs and Agents daily, I've built production systems on top of them, I think they're genuinely transformative. But there's a difference between using AI where it adds real value and using AI because it's the default setting on every product roadmap.
Making a cartoon mouth open and close in time with audio is not an AI problem. It's a timing problem. And timing is just maths.
Sometimes the best thing you can do is slow down, look at what you're actually trying to achieve, and ask: is this a hammer problem or a cloud-sync problem?
Have a play with the demo, have a look at the source, and if you've got an animated character that needs to move its mouth, maybe a few hundred lines of vanilla TypeScript might be all you need.