Laika — Technical · Architecture, Performance, Engineering

Architecture

Every stage streams. No stage waits.

Mic to first phoneme in under 1.2 seconds means nothing in the loop sits and waits. Audio chunks flow as they arrive; tokens stream into TTS as the LLM emits them; blendshapes drive the renderer the moment the first 80 ms of audio lands.

01 · Mic

Browser

audio chunks
WS →
02 · STT GPU

Whisper

partial transcripts
stream →
03 · LLM

OpenRouter

token stream
SSE →
04 · TTS GPU

F5-TTS

voice-cloned audio
PCM →
05 · Animator GPU

A2F-3D

ARKit blendshapes 30Hz
JSON →
06 · Renderer GPU

GSAC

H.264 frames

Performance targets

Adding up to 1210 milliseconds.

Lifted from avatar-mvp-spec.md §6. Numbers assume streaming-capable models and the animator + renderer co-located on the same host.

Stage	Target	Notes
VAD detect end-of-speech	150 ms	Silero VAD, well-tuned
STT final transcript	200 ms	Streaming Whisper, last chunk
LLM time-to-first-token	400 ms	OpenRouter Haiku 4.5; varies by model
TTS first audio chunk	250 ms	F5-TTS, streaming
Audio2Face first blendshape	80 ms	Per-frame after first audio
Renderer first frame	50 ms	GSAC inference + encode
WebRTC delivery	80 ms	Local network
Total	~1210 ms	End-of-speech → first phoneme + matching mouth movement

Stack

Open, swappable, no lock-in.

Every model in the loop is open-source or vendor-swappable. The whole pipeline is designed to fit one workstation GPU.

Frontend: SvelteKit
Gateway: Bun + Hono
LLM: OpenRouter (Claude / GPT / Llama)
STT: faster-whisper large-v3-turbo
TTS: F5-TTS, voice-cloned
Animator: NVIDIA Audio2Face-3D
Renderer: GSAC (3D Gaussian Splatting)
Transport: H.264 / WebRTC
State: Redis
Training: RunPod H100 · ~$5 / run

Engineering

Where the milliseconds and gigabytes go.

Two budgets define the product: an end-to-end latency target of 1.2 seconds, and a single-GPU VRAM footprint of roughly 13 GB. Both are reproducible, traced, and visible in production.

End-to-end latency

Voice in → first phoneme out

GPU

CPU / network

VAD

150 ms
STT

200 ms
LLM

400 ms
TTS

250 ms
Animator

80 ms
Renderer

50 ms
WebRTC

80 ms

0ms

200ms

400ms

600ms

800ms

1000ms

1200ms

Total budget: 1210 ms · First phoneme at ~1000 ms · First mouth movement at ~1080 ms

VRAM allocation

13.0 GB on a single 5070 Ti.

The whole pipeline lives on one card. Headroom is deliberate — buffers and working set absorb the variance from streaming.

2G

3.5G

2.5G

4.5G

0.5G

Whisper STT

large-v3-turbo, FP16
F5-TTS

voice clone + streaming
Audio2Face-3D

blendshape driver
GSAC renderer

Gaussian splat + encode
Headroom

buffers + working set

Thinking

Three choices that defined the system.

Not obvious at the start. Obvious now — only because the system works.

Representation

Splats over meshes and NeRF.

Meshes need expensive scans and look uncanny at the edges. NeRF burns budget per frame. Splats train cheap, render cheap, and fail blurry — not wrong.

Execution

Streaming over batched.

Sub-1.2-second response means no stage can sit and wait. Every component selection collapses to one criterion: does it stream natively.

Stack

Open over proprietary.

Whisper, F5-TTS, Audio2Face, GSAC, OpenRouter. None of them ours. The right product role for Laika is orchestration and identity, not foundational research.

Engineering note · follow the log on GitHub ↗

Why Laika

A real product, not a demo.

Three commitments that shape every architecture decision in the codebase.

Open by default

Built for the open web.

Every model in the loop — Whisper, F5-TTS, Audio2Face, GSAC — is open-source. The LLM is OpenRouter, so you can swap Claude, GPT, or Llama without changing a line. Your stack stays yours.

Open-source models
Vendor-swappable LLM
BYO inference

Single-GPU footprint

Designed for one workstation.

The whole pipeline fits in roughly 13 GB of VRAM on a single 5070 Ti. No farm required, no per-session GPU markup. Train on a $5 H100 hour, then run inference on hardware you already own.

~13 GB live footprint
30 fps at 720p
$3–6 per training run

Identity stays yours

Your face stays your face.

Voice and identity are trained from your own monocular capture. Inference runs on infrastructure you control — there is no Laika cloud you have to send your likeness through.

5-minute capture
Self-hostable end-to-end
Identity tokens are local

Quickstart

Train an avatar in one command. Embed it in three lines.

The whole pipeline is one CLI plus one client SDK. Capture, train, serve, embed — the same flow whether you're shipping to a Svelte app, a Python service, or a curl request.

Request access

terminal

# 1. Train a personal avatar from a 5-minute capture
$ laika train --capture ./reference --out avatar.ply
  → Uploading 312 frames to RunPod H100…
  → COLMAP poses · SMPL-X fits · GSAC train (32 min)
  → Saved avatar.ply (84 MB) · cost $4.91

# 2. Serve it
$ laika serve --avatar avatar.ply --port 3000
  → Gateway ready · WebRTC on :3000 · GPU 12.4G live

client.ts

import Laika from "@laika/sdk"

const session = await Laika.connect({
  avatar: "me-v3",
  video:  videoEl,
  audio:  micStream,
})

session.on("latency", m => hud.render(m))

from laika import Laika

session = Laika.connect(
    avatar="me-v3",
    audio=mic_stream,
)

async for frame in session.stream():
    writer.write(frame)

# Create a session and get the WebRTC offer SDP back
$ curl -X POST https://laika.dynamics/sessions \
    -H "Authorization: Bearer $LAIKA_KEY" \
    -d '{"avatar":"me-v3"}'

→ {
    "session_id": "8a3f7c",
    "ws":         "wss://laika.dynamics/session/8a3f7c"
  }

Training pipeline

From phone capture to deployable avatar.

One offline pipeline turns five minutes of monocular video into a deployable Gaussian-splat avatar. CPU prep happens on your workstation; the GSAC train runs on a rented H100. The whole thing fits in a coffee break and costs less than lunch.

Total time

~3.0 hr

end-to-end

GPU cost

$3–6

per avatar

Output

84 MB

avatar.ply

01 · Capture

5 min

Local

Monocular phone video, 1080p, ±90° head turns
02 · Frame extract

~30 s

Local CPU

ffmpeg, 312 frames at 10 fps
03 · COLMAP poses

~2 hr

Local CPU

Structure-from-motion, per-frame camera poses
04 · SMPL-X fits

~10 min

Local CPU

Per-frame face / body parameter fits
05 · Upload

~1 min

Network

Compressed prep payload to RunPod
06 · GSAC train $3.20

~32 min

RunPod H100

30k iterations, face_priority on H100
07 · Download

~30 s

Network

avatar.ply (84 MB) + tracking metadata

Numbers from avatar-mvp-spec.md §9 · H100 hour rate at ~$6/hr · Steps 01–04 run on the workstation, no GPU required.

Telemetry · run 0042

A real training run, in the open.

Every avatar train is logged end-to-end. The numbers below come from a single 30k-iteration run on a single H100 — the same shape as every run we ship with the SDK.

Status

Completed 2026-04-22 14:08 UTC

Duration

31m 47s

Training loss

L1 photometric loss · log scale

Final

0.0083

Δ vs init

−99.0%

GPU utilization

H100 80GB · 32-minute window

Avg

87%

Output

avatar.ply

84 MB PLY · binary little-endian

Gaussians

178,432

Render speed

42 fps

Train your own avatar

Final L1 loss

0.0083

30,000 iterations

PSNR

32.1 dB

test split, hold-out frames

SSIM

0.961

structural similarity

Gaussians

178,432

in avatar.ply

Train time

31m 47s

single H100 80GB

GPU cost

$3.20

RunPod hourly rate

Try it

Watch the pipeline run.

Type something. The same six stages your voice would travel through, animated with realistic timings, will fire here in the page.

Pipeline target < 1060 ms

01 Whisper

—
02 Router

—
03 F5-TTS

—
04 Animator

—
05 Renderer

—
06 WebRTC

—

Elapsed 0 ms

Avatar response idle

Live

Demo is fully client-side — no server round-trip. Numbers reflect the same per-stage budget the real pipeline targets.

Where Laika sits

A category of one — for now.

Talking-head SaaS gives you a face but not a real-time voice loop. Chatbots give you a voice loop but not a face. Laika is the first product to ship both, on a stack you can run.

Capability	Laika	Talking-head SaaS D-ID, HeyGen	Standard chatbot ChatGPT, Claude
Realtime voice → mouth sync	Not applicable	No	Not applicable
Photoreal 3D avatar	Not applicable	No	No
Custom-trained on you	Not applicable	Partial	No
Voice-cloned to you	Not applicable	Yes	No
Sub-1.2 s end-to-end response	Not applicable	No	Yes
Self-hostable	Not applicable	No	Partial
Open stack (no lock-in)	Not applicable	No	Partial
Single-GPU inference	Not applicable	No	Yes

Comparison reflects the current public capabilities of named products as of May 2026.

Product surfaces

Three views of the same session.

The conversation surface is the visible piece. Behind it, three developer surfaces track exactly what's happening — every one of them ships in the same SDK.

Fig. 02

Per-stage latency view, surfaced from Redis traces. Stage cycles light up as the pipeline progresses.

Fig. 03

Memory inspector. Every fact the avatar remembers is timestamped, scored, and editable.

runpod h100 · gsac train

running

# gsac train --source /data/in --output /data/out/avatar.ply
→ Loaded 312 frames, 64 cameras (COLMAP)
→ Initialized 87,341 Gaussians from SfM points
→ face_priority: enabled (mask weight 4.0)
→ Starting 30k-iteration training loop…

[iter 14,326] loss=0.0241  psnr=28.9dB  gauss=164,712

Fig. 04

Live training output from a RunPod H100. Iteration, loss, PSNR, Gaussian count tick as the run progresses.

Common questions

Questions, answered.

Pulled from real conversations with developers and operators evaluating Laika. Missing something? Email us.

How long does training a custom avatar take?

About 30–60 minutes on a rented RunPod H100, end-to-end — including upload, COLMAP poses, SMPL-X fits, and the GSAC train. The capture itself is 5 minutes of monocular phone video. A single training run costs $3–6 of compute.

What does it cost to run in production?

Inference is bound by the renderer + animator, which fits in ~13 GB VRAM. A single 5070 Ti or rented A10 (≈$0.30/hour on Vast.ai) handles one concurrent session at 30 fps. You buy the GPU once or pay per hour — no per-minute or per-token pricing from Laika.

Can I bring my own LLM?

Yes. The gateway speaks OpenRouter, which fronts Claude, GPT, Llama, Mistral, and most open-weight models. For local-only deployments, point it at a vLLM endpoint serving Llama 3.3 on a second GPU.

Is voice cloning consent-based?

Yes — F5-TTS clones from a reference recording you provide, and the gateway requires the trained voice token to match the active session. There is no public model that can be invoked against arbitrary identity.

Can I self-host the whole thing?

Yes — that's the default deployment. The repo ships a docker-compose that brings up the gateway, renderer, and Redis on one machine. There is no Laika cloud you have to route through.

What hardware do I actually need?

For inference: a single 16 GB+ GPU (5070 Ti, RTX 4090, A10, A100). For training: rent an H100 by the hour from RunPod — owning the card isn't worth it for occasional retrains. The CPU side runs comfortably on a modern workstation.

Ready when you are

Ship a face this quarter.

Get your trained avatar, the SDK, and direct access to engineering — for less than a week of cloud inference would cost.

Reserve early access Talk to us

How the pipeline actually works.

Adding up to 1210 milliseconds.

Open, swappable, no lock-in.

Voice in → first phoneme out

13.0 GB on a single 5070 Ti.

Splats over meshes and NeRF.

Streaming over batched.

Open over proprietary.

Built for the open web.

Designed for one workstation.

Your face stays your face.

Train an avatar in one command. Embed it in three lines.

L1 photometric loss · log scale

H100 80GB · 32-minute window

avatar.ply

Questions, answered.

Ship a face this quarter.