Engineering deep-dive

How the pipeline actually works.

Architecture, latency budget, VRAM allocation, training pipeline, real run telemetry, and the architectural choices behind the system. Skim the diagrams, read the numbers, run the interactive demo.

1.18s end-to-end latency
13 GB single-GPU footprint
30 fps 720p H.264 / WebRTC
$3–6 per avatar trained

Built on Bun · Hono · SvelteKit · Whisper · F5-TTS · NVIDIA Audio2Face · GSAC

Architecture

Every stage streams. No stage waits.

Mic to first phoneme in under 1.2 seconds means nothing in the loop sits and waits. Audio chunks flow as they arrive; tokens stream into TTS as the LLM emits them; blendshapes drive the renderer the moment the first 80 ms of audio lands.

  1. 01 · Mic
    Browser
    audio chunks
  2. 02 · STT GPU
    Whisper
    partial transcripts
  3. 03 · LLM
    OpenRouter
    token stream
  4. 04 · TTS GPU
    F5-TTS
    voice-cloned audio
  5. 05 · Animator GPU
    A2F-3D
    ARKit blendshapes 30Hz
  6. 06 · Renderer GPU
    GSAC
    H.264 frames

Performance targets

Adding up to 1210 milliseconds.

Lifted from avatar-mvp-spec.md §6. Numbers assume streaming-capable models and the animator + renderer co-located on the same host.

Stage Target
VAD detect end-of-speech 150 ms
STT final transcript 200 ms
LLM time-to-first-token 400 ms
TTS first audio chunk 250 ms
Audio2Face first blendshape 80 ms
Renderer first frame 50 ms
WebRTC delivery 80 ms
Total ~1210 ms

Stack

Open, swappable, no lock-in.

Every model in the loop is open-source or vendor-swappable. The whole pipeline is designed to fit one workstation GPU.

Frontend
SvelteKit
Gateway
Bun + Hono
LLM
OpenRouter (Claude / GPT / Llama)
STT
faster-whisper large-v3-turbo
TTS
F5-TTS, voice-cloned
Animator
NVIDIA Audio2Face-3D
Renderer
GSAC (3D Gaussian Splatting)
Transport
H.264 / WebRTC
State
Redis
Training
RunPod H100 · ~$5 / run

Engineering

Where the milliseconds and gigabytes go.

Two budgets define the product: an end-to-end latency target of 1.2 seconds, and a single-GPU VRAM footprint of roughly 13 GB. Both are reproducible, traced, and visible in production.

End-to-end latency

Voice in → first phoneme out

GPU
CPU / network
  1. VAD
    150 ms
  2. STT
    200 ms
  3. LLM
    400 ms
  4. TTS
    250 ms
  5. Animator
    80 ms
  6. Renderer
    50 ms
  7. WebRTC
    80 ms
0ms
200ms
400ms
600ms
800ms
1000ms
1200ms

Total budget: 1210 ms · First phoneme at ~1000 ms · First mouth movement at ~1080 ms

VRAM allocation

13.0 GB on a single 5070 Ti.

The whole pipeline lives on one card. Headroom is deliberate — buffers and working set absorb the variance from streaming.

2G
3.5G
2.5G
4.5G
0.5G
  • Whisper STT
    large-v3-turbo, FP16
  • F5-TTS
    voice clone + streaming
  • Audio2Face-3D
    blendshape driver
  • GSAC renderer
    Gaussian splat + encode
  • Headroom
    buffers + working set

Thinking

Three choices that defined the system.

Not obvious at the start. Obvious now — only because the system works.

Representation

Splats over meshes and NeRF.

Meshes need expensive scans and look uncanny at the edges. NeRF burns budget per frame. Splats train cheap, render cheap, and fail blurry — not wrong.

Execution

Streaming over batched.

Sub-1.2-second response means no stage can sit and wait. Every component selection collapses to one criterion: does it stream natively.

Stack

Open over proprietary.

Whisper, F5-TTS, Audio2Face, GSAC, OpenRouter. None of them ours. The right product role for Laika is orchestration and identity, not foundational research.

Engineering note · follow the log on GitHub ↗

Why Laika

A real product, not a demo.

Three commitments that shape every architecture decision in the codebase.

Open by default

Built for the open web.

Every model in the loop — Whisper, F5-TTS, Audio2Face, GSAC — is open-source. The LLM is OpenRouter, so you can swap Claude, GPT, or Llama without changing a line. Your stack stays yours.

  • Open-source models
  • Vendor-swappable LLM
  • BYO inference

Single-GPU footprint

Designed for one workstation.

The whole pipeline fits in roughly 13 GB of VRAM on a single 5070 Ti. No farm required, no per-session GPU markup. Train on a $5 H100 hour, then run inference on hardware you already own.

  • ~13 GB live footprint
  • 30 fps at 720p
  • $3–6 per training run

Identity stays yours

Your face stays your face.

Voice and identity are trained from your own monocular capture. Inference runs on infrastructure you control — there is no Laika cloud you have to send your likeness through.

  • 5-minute capture
  • Self-hostable end-to-end
  • Identity tokens are local

Quickstart

Train an avatar in one command. Embed it in three lines.

The whole pipeline is one CLI plus one client SDK. Capture, train, serve, embed — the same flow whether you're shipping to a Svelte app, a Python service, or a curl request.

Request access
terminal
# 1. Train a personal avatar from a 5-minute capture
$ laika train --capture ./reference --out avatar.ply
  → Uploading 312 frames to RunPod H100…
  → COLMAP poses · SMPL-X fits · GSAC train (32 min)
  → Saved avatar.ply (84 MB) · cost $4.91

# 2. Serve it
$ laika serve --avatar avatar.ply --port 3000
  → Gateway ready · WebRTC on :3000 · GPU 12.4G live
import Laika from "@laika/sdk"

const session = await Laika.connect({
  avatar: "me-v3",
  video:  videoEl,
  audio:  micStream,
})

session.on("latency", m => hud.render(m))

Training pipeline

From phone capture to deployable avatar.

One offline pipeline turns five minutes of monocular video into a deployable Gaussian-splat avatar. CPU prep happens on your workstation; the GSAC train runs on a rented H100. The whole thing fits in a coffee break and costs less than lunch.

Total time
~3.0 hr
end-to-end
GPU cost
$3–6
per avatar
Output
84 MB
avatar.ply
  1. 01 · Capture
    5 min
    Local

    Monocular phone video, 1080p, ±90° head turns

  2. 02 · Frame extract
    ~30 s
    Local CPU

    ffmpeg, 312 frames at 10 fps

  3. 03 · COLMAP poses
    ~2 hr
    Local CPU

    Structure-from-motion, per-frame camera poses

  4. 04 · SMPL-X fits
    ~10 min
    Local CPU

    Per-frame face / body parameter fits

  5. 05 · Upload
    ~1 min
    Network

    Compressed prep payload to RunPod

  6. 06 · GSAC train $3.20
    ~32 min
    RunPod H100

    30k iterations, face_priority on H100

  7. 07 · Download
    ~30 s
    Network

    avatar.ply (84 MB) + tracking metadata

Numbers from avatar-mvp-spec.md §9 · H100 hour rate at ~$6/hr · Steps 01–04 run on the workstation, no GPU required.

Telemetry · run 0042

A real training run, in the open.

Every avatar train is logged end-to-end. The numbers below come from a single 30k-iteration run on a single H100 — the same shape as every run we ship with the SDK.

Status
Completed 2026-04-22 14:08 UTC
Duration
31m 47s

Training loss

L1 photometric loss · log scale

Final
0.0083
Δ vs init
−99.0%
0.83 0.008 iter 0 iter 30k

GPU utilization

H100 80GB · 32-minute window

Avg
87%
100% 0%

Output

avatar.ply

84 MB PLY · binary little-endian
Gaussians
178,432
Render speed
42 fps
Train your own avatar
Final L1 loss
0.0083
30,000 iterations
PSNR
32.1 dB
test split, hold-out frames
SSIM
0.961
structural similarity
Gaussians
178,432
in avatar.ply
Train time
31m 47s
single H100 80GB
GPU cost
$3.20
RunPod hourly rate

Try it

Watch the pipeline run.

Type something. The same six stages your voice would travel through, animated with realistic timings, will fire here in the page.

Pipeline target < 1060 ms
  1. 01 Whisper
  2. 02 Router
  3. 03 F5-TTS
  4. 04 Animator
  5. 05 Renderer
  6. 06 WebRTC
Elapsed 0 ms
Avatar response idle
Live

Demo is fully client-side — no server round-trip. Numbers reflect the same per-stage budget the real pipeline targets.

Where Laika sits

A category of one — for now.

Talking-head SaaS gives you a face but not a real-time voice loop. Chatbots give you a voice loop but not a face. Laika is the first product to ship both, on a stack you can run.

Capability
Laika
Talking-head SaaS D-ID, HeyGen
Standard chatbot ChatGPT, Claude
Realtime voice → mouth sync Not applicable No Not applicable
Photoreal 3D avatar Not applicable No No
Custom-trained on you Not applicable Partial No
Voice-cloned to you Not applicable Yes No
Sub-1.2 s end-to-end response Not applicable No Yes
Self-hostable Not applicable No Partial
Open stack (no lock-in) Not applicable No Partial
Single-GPU inference Not applicable No Yes

Comparison reflects the current public capabilities of named products as of May 2026.

Product surfaces

Three views of the same session.

The conversation surface is the visible piece. Behind it, three developer surfaces track exactly what's happening — every one of them ships in the same SDK.

Fig. 02

Per-stage latency view, surfaced from Redis traces. Stage cycles light up as the pipeline progresses.

Fig. 03

Memory inspector. Every fact the avatar remembers is timestamped, scored, and editable.

Fig. 04

Live training output from a RunPod H100. Iteration, loss, PSNR, Gaussian count tick as the run progresses.

Common questions

Questions, answered.

Pulled from real conversations with developers and operators evaluating Laika. Missing something? Email us.

How long does training a custom avatar take?

About 30–60 minutes on a rented RunPod H100, end-to-end — including upload, COLMAP poses, SMPL-X fits, and the GSAC train. The capture itself is 5 minutes of monocular phone video. A single training run costs $3–6 of compute.

What does it cost to run in production?

Inference is bound by the renderer + animator, which fits in ~13 GB VRAM. A single 5070 Ti or rented A10 (≈$0.30/hour on Vast.ai) handles one concurrent session at 30 fps. You buy the GPU once or pay per hour — no per-minute or per-token pricing from Laika.

Can I bring my own LLM?

Yes. The gateway speaks OpenRouter, which fronts Claude, GPT, Llama, Mistral, and most open-weight models. For local-only deployments, point it at a vLLM endpoint serving Llama 3.3 on a second GPU.

Is voice cloning consent-based?

Yes — F5-TTS clones from a reference recording you provide, and the gateway requires the trained voice token to match the active session. There is no public model that can be invoked against arbitrary identity.

Can I self-host the whole thing?

Yes — that's the default deployment. The repo ships a docker-compose that brings up the gateway, renderer, and Redis on one machine. There is no Laika cloud you have to route through.

What hardware do I actually need?

For inference: a single 16 GB+ GPU (5070 Ti, RTX 4090, A10, A100). For training: rent an H100 by the hour from RunPod — owning the card isn't worth it for occasional retrains. The CPU side runs comfortably on a modern workstation.

Ready when you are

Ship a face this quarter.

Get your trained avatar, the SDK, and direct access to engineering — for less than a week of cloud inference would cost.