Autoscaling Large Language Model Services: Policies, Signals, and Costs
Autoscaling LLM services requires specialized metrics like prefill queue size and slots_used-not CPU or GPU usage. Learn how to balance cost and latency with real-world policies for chatbots, batch jobs, and real-time apps.