Multi-tenant LLM inference platform

§ case / Lumen Labs

Multi-tenant LLM inference platform

Built a GPU-backed inference platform on GKE, auto-scaling from 0 to 200 pods with under 4s cold starts.

Designed a multi-tenant inference stack with token-based rate limiting, isolated namespaces and shared model caches. Horizontal autoscaling driven by custom metrics.