Live · real Modal GPUs · one button

Watch render demand overflow onto Modal — in real time

In-house covers 30 concurrent renders. Spike above that and the scheduler spills the extra to Modal, which provisions real H100 GPUs and scales to zero. Toggle to the real cold-start timing to see the honest cost.

Mode:
Adds 16 real renders over the in-house line → Modal provisions ~8 H100 GPUs. connecting…
① Render demand
30concurrent renders requested
routing…
② Scheduler / gateway — fill in-house first, spill the rest
30 → in-house (≤30)
0 → overflow renders to Modal
in-house▼      ▼overflow
③a In-house cluster — fixed, always on
0 / 30 concurrent renders · capacity never changes · flat cost
Fleet idle — 0 GPUs up. $0/hr.
0
H100 GPUs up
0
Rendering now
0
Renders queued
0%
Utilization
0
Free slots
$0/hr
Burn (H100 est.)
Render slots — each box = one real H100 (2 slots; blue = a render; red bar = cold-start):
Queue:
GPUs over time
Live events — what the scheduler & Modal are doing:
waiting for a spike…
Open the real Modal dashboard ↗
Two honest modes. ⚡ Live is real Modal autoscaling, but the worker is a tiny CPU image (~100 MB) that Modal starts in seconds — great for seeing the mechanism, not the renderer's real speed. 🎬 wan-s2v models the production renderer's measured 131 s cold start (93 GB image + ~50 GB weights → H100) in real time — watch the queue pile up until the first GPU is ready. That's why the design pre-warms: tick pre-warmed pool to make cold start ~0 s. Numbers in Live mode are live get_current_stats(); the 131 s and the H100 burn are measured/list-price values (see “Proven”).

Proven, not promised

Measured against the live production deployment (the repo's e2e CI test).

WhatMeasured
Cold provision — wan-s2v renderer (scale-zero → serving)131 s < 3 min
Warm provision~0.4 s < 1 min
Memory-snapshot restore (alpha)~57 s
Production API surface74 / 74 endpoints served (curl-verified)
Render throughput~365 ms / block · ~44 s / clip (offline)
Idle cost$0 (scales to zero)