Elastic GPU Scale-Up — live burst-overflow on Modal

Mode:

Adds 16 real renders over the in-house line → Modal provisions ~8 H100 GPUs. connecting…

① Render demand

30concurrent renders requested

▼routing…▼

② Scheduler / gateway — fill in-house first, spill the rest

30 → in-house (≤30)

0 → overflow renders to Modal

in-house▼ ▼overflow

③a In-house cluster — fixed, always on

0 / 30 concurrent renders · capacity never changes · flat cost

Fleet idle — 0 GPUs up. $0/hr.

H100 GPUs up

Rendering now

Renders queued

Utilization

Free slots

$0/hr

Burn (H100 est.)

Render slots — each box = one real H100 (2 slots; blue = a render; red bar = cold-start):

Queue:

GPUs over time

Live events — what the scheduler & Modal are doing:

waiting for a spike…

Open the real Modal dashboard ↗

Two honest modes. ⚡ Live is real Modal autoscaling, but the worker is a tiny CPU image (~100 MB) that Modal starts in seconds — great for seeing the mechanism, not the renderer's real speed. 🎬 higgs-avatar models the production replica's measured 405 s cold→traffic-ready (fast-boot 91 s + self-warmup render: JIT/compile + CUDA-graph capture paid privately before the port opens) in real time — watch the queue pile up until the first GPU is ready. That's why the design pre-warms: tick pre-warmed pool to make cold start ~0 s. Numbers in Live mode are live get_current_stats(); the 405 s and the GPU burn are measured/list-price values (see “Proven”).

Proven, not promised

Measured against the live production deployment (the repo's e2e CI test).

What	Measured
Cold replica → traffic-ready (fast-boot + self-warmup, deployed)	405 s
Warm provision	~0.4 s < 1 min
GPU snapshot / compile-cache shortcuts	ruled out by measurement
Production API surface	74 / 74 endpoints served (curl-verified)
Render (warm replica)	init.fmp4 ~3–9 s · 3 s-audio clip in ~9–12 s
Idle cost	$0 (scales to zero)