Skip to main content
Target: 10 concurrent voice sessions, 8K LLM context window. STT and TTS model weights are shared across sessions; LLM KV cache is per-session and dominates VRAM.

LLM VRAM (Q4_K_M quantization)

ModelWeightsKV @ 8K × 10Total VRAM
Gemma 3 4B2.5 GB8 GB~11 GB
Gemma 3 12B7 GB20 GB~29 GB
Gemma 3 27B16 GB30 GB~48 GB
Llama 3.1 8B5 GB15 GB~22 GB
Llama 3.3 70B40 GB40 GB~84 GB
Llama-4 Scout 109B MoE55 GB60 GB~118 GB

STT VRAM (shared weights + per-session)

ModelWeights+ 10 sessionsTotal
Qwen3-ASR 1.7B3.4 GB2 GB~5.5 GB
Cohere Transcribe 2B4 GB2.5 GB~6.5 GB
faster-whisper large v3 (int8)1.5 GB1.5 GB~3 GB
Vosk0.2 GB~1 GB RAMCPU-only

TTS VRAM (shared weights + per-session)

ModelWeights+ 10 sessionsTotal
VoxCPM22 GB3 GB~5 GB
Qwen3-TTS 1.7B3.4 GB4 GB~7.5 GB
CosyVoice21 GB2 GB~3 GB
F5-TTS1.5 GB3 GB~4.5 GB
Orpheus2 GB2.5 GB~4.5 GB
Kokoro / Piper≤ 0.3 GBCPU-friendly

GPU tier matrix

TierLLMMinimum GPUStable GPU
EntryGemma 3 4BRTX 4090 24 GBRTX 6000 Ada 48 GB
BalancedGemma 3 12BRTX 6000 Ada 48 GBL40S 48 GB
RecommendedGemma 3 27B / Llama 3 8BH100 80 GBMI300X 192 GB
ProLlama 3.3 70BMI300X 192 GBH200 141 GB
FlagshipLlama-4 Scout 109BH200 141 GBMI300X 192 GB
Minimum: fits the stated workload at 4K context, no headroom. Stable: 8K context with ~30 % VRAM headroom for fragmentation + burst concurrency.

Host sizing per GPU node

ResourceMinimumStable
System RAM2× GPU VRAM3× GPU VRAM
NVMe100 GB500 GB
CPU8 cores16 cores
PCIeGen4 x8Gen4/5 x16
Network1 Gbps10 Gbps

Sizing rules

Concurrency is KV-cache bound. Halving context length halves per-session memory cost.
LLM throughput is memory-bandwidth bound, not FLOPs — HBM3/HBM3e matters more than core count.
vLLM’s paged KV cache + continuous batching ≈ 3× the concurrency of vanilla Ollama on the same hardware. Prefer it for production deployments.

Topology options

Single host

Everything in one Kubernetes cluster (single-host or multi-node). Simplest deployment. Suitable for sub-10-concurrent voice loads on a single GPU box, plus a smaller CPU-only sidecar for the data tier if desired.

Split (control plane + GPU)

Control-plane services (frontend, backend, database, LiveKit, cache) on CPU hosts; the GPU-bound services (STT, TTS, LLM, agent worker) on one or more GPU hosts. Connect the two over a private network — VPN, WireGuard, Tailscale, or a private VPC peering link. For larger fleets, the split scales to many GPU hosts and a single control-plane cluster.