Hardware sizing

Model	Weights	KV @ 8K × 10	Total VRAM
Gemma 3 4B	2.5 GB	8 GB	~11 GB
Gemma 3 12B	7 GB	20 GB	~29 GB
Gemma 3 27B	16 GB	30 GB	~48 GB
Llama 3.1 8B	5 GB	15 GB	~22 GB
Llama 3.3 70B	40 GB	40 GB	~84 GB
Llama-4 Scout 109B MoE	55 GB	60 GB	~118 GB

Model

Weights

KV @ 8K × 10

Total VRAM

Gemma 3 4B

2.5 GB

8 GB

~11 GB

Gemma 3 12B

7 GB

20 GB

~29 GB

Gemma 3 27B

16 GB

30 GB

~48 GB

Llama 3.1 8B

5 GB

15 GB

~22 GB

Llama 3.3 70B

40 GB

~84 GB

Llama-4 Scout 109B MoE

55 GB

60 GB

~118 GB

Model	Weights	+ 10 sessions	Total
Qwen3-ASR 1.7B	3.4 GB	2 GB	~5.5 GB
Cohere Transcribe 2B	4 GB	2.5 GB	~6.5 GB
faster-whisper large v3 (int8)	1.5 GB	1.5 GB	~3 GB
Vosk	0.2 GB	~1 GB RAM	CPU-only

Model

Weights

+ 10 sessions

Total

Qwen3-ASR 1.7B

3.4 GB

2 GB

~5.5 GB

Cohere Transcribe 2B

4 GB

2.5 GB

~6.5 GB

faster-whisper large v3 (int8)

1.5 GB

~3 GB

Vosk

0.2 GB

~1 GB RAM

CPU-only

Model	Weights	+ 10 sessions	Total
VoxCPM2	2 GB	3 GB	~5 GB
Qwen3-TTS 1.7B	3.4 GB	4 GB	~7.5 GB
CosyVoice2	1 GB	2 GB	~3 GB
F5-TTS	1.5 GB	3 GB	~4.5 GB
Orpheus	2 GB	2.5 GB	~4.5 GB
Kokoro / Piper	≤ 0.3 GB	CPU-friendly	—

Model

Weights

+ 10 sessions

Total

VoxCPM2

2 GB

3 GB

~5 GB

Qwen3-TTS 1.7B

3.4 GB

4 GB

~7.5 GB

CosyVoice2

1 GB

2 GB

~3 GB

F5-TTS

1.5 GB

3 GB

~4.5 GB

Orpheus

2 GB

2.5 GB

~4.5 GB

Kokoro / Piper

≤ 0.3 GB

CPU-friendly

—

GPU tier matrix

Tier	LLM	Minimum GPU	Stable GPU
Entry	Gemma 3 4B	RTX 4090 24 GB	RTX 6000 Ada 48 GB
Balanced	Gemma 3 12B	RTX 6000 Ada 48 GB	L40S 48 GB
Recommended	Gemma 3 27B / Llama 3 8B	H100 80 GB	MI300X 192 GB
Pro	Llama 3.3 70B	MI300X 192 GB	H200 141 GB
Flagship	Llama-4 Scout 109B	H200 141 GB	MI300X 192 GB

Minimum: fits the stated workload at 4K context, no headroom. Stable: 8K context with ~30 % VRAM headroom for fragmentation + burst concurrency.

Resource	Minimum	Stable
System RAM	2× GPU VRAM	3× GPU VRAM
NVMe	100 GB	500 GB
CPU	8 cores	16 cores
PCIe	Gen4 x8	Gen4/5 x16
Network	1 Gbps	10 Gbps

Resource

Minimum

Stable

System RAM

2× GPU VRAM

3× GPU VRAM

NVMe

100 GB

500 GB

CPU

8 cores

16 cores

PCIe

Gen4 x8

Gen4/5 x16

Network

1 Gbps

10 Gbps

Sizing rules

Concurrency is KV-cache bound. Halving context length halves per-session memory cost.

LLM throughput is memory-bandwidth bound, not FLOPs — HBM3/HBM3e matters more than core count.

vLLM’s paged KV cache + continuous batching ≈ 3× the concurrency of vanilla Ollama on the same hardware. Prefer it for production deployments.

Topology options

Single host

Everything in one Kubernetes cluster (single-host or multi-node). Simplest deployment. Suitable for sub-10-concurrent voice loads on a single GPU box, plus a smaller CPU-only sidecar for the data tier if desired.

Split (control plane + GPU)

Control-plane services (frontend, backend, database, LiveKit, cache) on CPU hosts; the GPU-bound services (STT, TTS, LLM, agent worker) on one or more GPU hosts. Connect the two over a private network — VPN, WireGuard, Tailscale, or a private VPC peering link.

For larger fleets, the split scales to many GPU hosts and a single control-plane cluster.

Hardware sizing

LLM VRAM (Q4_K_M quantization)

STT VRAM (shared weights + per-session)

TTS VRAM (shared weights + per-session)

GPU tier matrix

Host sizing per GPU node

Sizing rules

Topology options

Single host

Split (control plane + GPU)

​LLM VRAM (Q4_K_M quantization)

​STT VRAM (shared weights + per-session)

​TTS VRAM (shared weights + per-session)

​GPU tier matrix

​Host sizing per GPU node

​Sizing rules

​Topology options

​Single host

​Split (control plane + GPU)

LLM VRAM (Q4_K_M quantization)

STT VRAM (shared weights + per-session)

TTS VRAM (shared weights + per-session)

GPU tier matrix

Host sizing per GPU node

Sizing rules

Topology options

Single host

Split (control plane + GPU)