Representation
Splats over meshes and NeRF.
Meshes need expensive scans and look uncanny at the edges. NeRF burns budget per frame. Splats train cheap, render cheap, and fail blurry — not wrong.
Engineering deep-dive
Architecture, latency budget, VRAM allocation, training pipeline, real run telemetry, and the architectural choices behind the system. Skim the diagrams, read the numbers, run the interactive demo.
Built on Bun · Hono · SvelteKit · Whisper · F5-TTS · NVIDIA Audio2Face · GSAC
Architecture
Mic to first phoneme in under 1.2 seconds means nothing in the loop sits and waits. Audio chunks flow as they arrive; tokens stream into TTS as the LLM emits them; blendshapes drive the renderer the moment the first 80 ms of audio lands.
Performance targets
| Stage | Target |
|---|---|
| VAD detect end-of-speech | 150 ms |
| STT final transcript | 200 ms |
| LLM time-to-first-token | 400 ms |
| TTS first audio chunk | 250 ms |
| Audio2Face first blendshape | 80 ms |
| Renderer first frame | 50 ms |
| WebRTC delivery | 80 ms |
| Total | ~1210 ms |
Stack
Engineering
Two budgets define the product: an end-to-end latency target of 1.2 seconds, and a single-GPU VRAM footprint of roughly 13 GB. Both are reproducible, traced, and visible in production.
End-to-end latency
Total budget: 1210 ms · First phoneme at ~1000 ms · First mouth movement at ~1080 ms
VRAM allocation
Thinking
Not obvious at the start. Obvious now — only because the system works.
Representation
Meshes need expensive scans and look uncanny at the edges. NeRF burns budget per frame. Splats train cheap, render cheap, and fail blurry — not wrong.
Execution
Sub-1.2-second response means no stage can sit and wait. Every component selection collapses to one criterion: does it stream natively.
Stack
Whisper, F5-TTS, Audio2Face, GSAC, OpenRouter. None of them ours. The right product role for Laika is orchestration and identity, not foundational research.
Engineering note · follow the log on GitHub ↗
Why Laika
Three commitments that shape every architecture decision in the codebase.
Open by default
Every model in the loop — Whisper, F5-TTS, Audio2Face, GSAC — is open-source. The LLM is OpenRouter, so you can swap Claude, GPT, or Llama without changing a line. Your stack stays yours.
Single-GPU footprint
The whole pipeline fits in roughly 13 GB of VRAM on a single 5070 Ti. No farm required, no per-session GPU markup. Train on a $5 H100 hour, then run inference on hardware you already own.
Identity stays yours
Voice and identity are trained from your own monocular capture. Inference runs on infrastructure you control — there is no Laika cloud you have to send your likeness through.
Quickstart
The whole pipeline is one CLI plus one client SDK. Capture, train, serve, embed — the same flow whether you're shipping to a Svelte app, a Python service, or a curl request.
Request access# 1. Train a personal avatar from a 5-minute capture
$ laika train --capture ./reference --out avatar.ply
→ Uploading 312 frames to RunPod H100…
→ COLMAP poses · SMPL-X fits · GSAC train (32 min)
→ Saved avatar.ply (84 MB) · cost $4.91
# 2. Serve it
$ laika serve --avatar avatar.ply --port 3000
→ Gateway ready · WebRTC on :3000 · GPU 12.4G live import Laika from "@laika/sdk"
const session = await Laika.connect({
avatar: "me-v3",
video: videoEl,
audio: micStream,
})
session.on("latency", m => hud.render(m)) Training pipeline
One offline pipeline turns five minutes of monocular video into a deployable Gaussian-splat avatar. CPU prep happens on your workstation; the GSAC train runs on a rented H100. The whole thing fits in a coffee break and costs less than lunch.
Monocular phone video, 1080p, ±90° head turns
ffmpeg, 312 frames at 10 fps
Structure-from-motion, per-frame camera poses
Per-frame face / body parameter fits
Compressed prep payload to RunPod
30k iterations, face_priority on H100
avatar.ply (84 MB) + tracking metadata
Numbers from avatar-mvp-spec.md §9 · H100 hour rate at ~$6/hr · Steps 01–04 run on the workstation, no GPU required.
Telemetry · run 0042
Every avatar train is logged end-to-end. The numbers below come from a single 30k-iteration run on a single H100 — the same shape as every run we ship with the SDK.
Training loss
GPU utilization
Output
Try it
Type something. The same six stages your voice would travel through, animated with realistic timings, will fire here in the page.
Demo is fully client-side — no server round-trip. Numbers reflect the same per-stage budget the real pipeline targets.
Where Laika sits
Talking-head SaaS gives you a face but not a real-time voice loop. Chatbots give you a voice loop but not a face. Laika is the first product to ship both, on a stack you can run.
| Capability | Laika | Talking-head SaaS D-ID, HeyGen | Standard chatbot ChatGPT, Claude |
|---|---|---|---|
| Realtime voice → mouth sync | Not applicable | No | Not applicable |
| Photoreal 3D avatar | Not applicable | No | No |
| Custom-trained on you | Not applicable | Partial | No |
| Voice-cloned to you | Not applicable | Yes | No |
| Sub-1.2 s end-to-end response | Not applicable | No | Yes |
| Self-hostable | Not applicable | No | Partial |
| Open stack (no lock-in) | Not applicable | No | Partial |
| Single-GPU inference | Not applicable | No | Yes |
Comparison reflects the current public capabilities of named products as of May 2026.
Product surfaces
The conversation surface is the visible piece. Behind it, three developer surfaces track exactly what's happening — every one of them ships in the same SDK.
Per-stage latency view, surfaced from Redis traces. Stage cycles light up as the pipeline progresses.
Memory inspector. Every fact the avatar remembers is timestamped, scored, and editable.
Live training output from a RunPod H100. Iteration, loss, PSNR, Gaussian count tick as the run progresses.
Common questions
Pulled from real conversations with developers and operators evaluating Laika. Missing something? Email us.
About 30–60 minutes on a rented RunPod H100, end-to-end — including upload, COLMAP poses, SMPL-X fits, and the GSAC train. The capture itself is 5 minutes of monocular phone video. A single training run costs $3–6 of compute.
Inference is bound by the renderer + animator, which fits in ~13 GB VRAM. A single 5070 Ti or rented A10 (≈$0.30/hour on Vast.ai) handles one concurrent session at 30 fps. You buy the GPU once or pay per hour — no per-minute or per-token pricing from Laika.
Yes. The gateway speaks OpenRouter, which fronts Claude, GPT, Llama, Mistral, and most open-weight models. For local-only deployments, point it at a vLLM endpoint serving Llama 3.3 on a second GPU.
Yes — F5-TTS clones from a reference recording you provide, and the gateway requires the trained voice token to match the active session. There is no public model that can be invoked against arbitrary identity.
Yes — that's the default deployment. The repo ships a docker-compose that brings up the gateway, renderer, and Redis on one machine. There is no Laika cloud you have to route through.
For inference: a single 16 GB+ GPU (5070 Ti, RTX 4090, A10, A100). For training: rent an H100 by the hour from RunPod — owning the card isn't worth it for occasional retrains. The CPU side runs comfortably on a modern workstation.
Ready when you are
Get your trained avatar, the SDK, and direct access to engineering — for less than a week of cloud inference would cost.