Most engineers treat LLM serving benchmarks as a simple speed test. They fire up a script, measure the tokens per second, and declare victory. But that approach misses the point entirely. In production, your users don’t care about peak theoretical throughput. They care about whether their request times out when ten other people are asking questions at the same time. Real-world inference is messy. It involves cold starts, cache evictions, network jitter, and unpredictable context lengths. If you only test with synthetic, uniform loads, you are building a system that looks great in a slide deck but collapses under actual traffic.
To build reliable AI infrastructure, you need to move beyond basic metrics. You must simulate realistic loads and production patterns. This means combining hardware stress tests with user-experience simulations. It requires understanding the difference between what your GPU can physically do and what your application actually delivers to the end-user. Let’s break down how to structure these benchmarks so they reflect reality, not just marketing numbers.
The Difference Between Load Testing and Performance Benchmarking
First, we need to clear up a common confusion. Load testing and performance benchmarking are distinct disciplines that serve different purposes in LLM deployment. Think of it this way: performance benchmarking tells you how fast your model runs on specific hardware. Load testing tells you how many users your server can handle before it breaks.
Performance benchmarking isolates model efficiency. It measures metrics like Time-to-First-Token (TTFT) and inter-token latency. It helps you decide if Model A is faster than Model B on an NVIDIA H100. Load testing simulates concurrent requests. It reveals issues with autoscaling, network bandwidth, and memory saturation. According to technical documentation from NVIDIA, you need both. If you only benchmark performance, you might deploy a model that is incredibly fast for one user but crashes when five users hit it simultaneously. If you only load test, you might miss subtle inefficiencies in the decoding stage that waste compute resources.
- Performance Benchmarking: Focuses on throughput, latency, and token-level metrics. Best for comparing models or optimization techniques.
- Load Testing: Focuses on concurrent connections, error rates, and resource utilization under stress. Best for capacity planning and SLA verification.
Client-Side vs. Server-Side: Who Are You Measuring?
Where you run your benchmark script changes the results dramatically. Most teams default to client-side testing because it feels more "real." But there is a trap here.
Server-side benchmarking runs the measurement script on the same machine as the model server. This eliminates network latency. It gives you the purest view of hardware capability. If you are trying to figure out if upgrading from A100s to H100s will double your throughput, you use server-side benchmarks. Tools like Baseten recommend this for infrastructure optimization because it removes the noise of the internet.
Client-side benchmarking runs the script from a separate machine, mimicking a real user. This includes network latency, DNS resolution, and TLS handshake overhead. This is the metric that matters for your users’ experience. If your server-side TTFT is 100ms but your client-side TTFT is 800ms due to bad network routing, your users will complain. The rule of thumb is simple: server-side shows maximum potential; client-side shows actual delivery. For production readiness, you must prioritize client-side metrics, especially Time-to-First-Token (TTFT), because that is the moment users decide if your app is "fast" or "slow."
Simulating Realistic Workloads: Beyond Synthetic Data
Synthetic data is clean, but it lies. In a synthetic test, every prompt is 50 tokens long, and every response is 100 tokens. In production, some prompts are 5 words, others are 10,000 words. Some responses are "Yes," others are code blocks.
To get accurate results, you need to mimic production patterns. TensorMesh uses a tiling pattern methodology that cycles through contexts sequentially. It appends random questions to maintain cache pressure and maximizes cache evictions. Why? Because caching is the biggest lever in LLM serving performance. If your benchmark doesn’t stress the KV cache, you aren’t seeing the true cost of your serving stack.
When designing your load profile, consider these variables:
- Context Length Distribution: Use a mix of short and long contexts. Long contexts consume more VRAM and slow down pre-filling.
- Concurrency Levels: Start low and ramp up. Watch for the point where TTFT spikes non-linearly.
- Cache Hit Rates: Simulate repeated queries to test prefix caching effectiveness.
- Cold Starts: Measure the time from idle to first response. This is critical for serverless deployments.
Ignore the first 30-60 seconds of any benchmark. This is the "warm-up" period where the GPU driver initializes and the kernel compiles JIT operations. Metrics during this phase are useless for steady-state analysis. Look at the steady state after several complete context rotations.
Key Metrics That Actually Matter
You don’t need to track fifty metrics. You need to track the ones that impact cost and user satisfaction. Here are the critical KPIs for LLM serving stacks:
| Metric | Definition | Why It Matters |
|---|---|---|
| QPS (Queries Per Second) | Total successful requests processed per second. | Determines throughput capacity. Higher QPS means lower infrastructure cost per request. |
| TTFT (Time-to-First-Token) | Latency from request submission to the first token generated. | Critical for user perception. High TTFT feels like a broken app. Target <1s for chat interfaces. |
| Inter-Token Latency | Average time between subsequent tokens. | Affects reading experience. Consistent latency is better than bursty latency. |
| P90/P99 Latency | Latency experienced by the slowest 10% or 1% of requests. | Reveals tail risks. Average latency hides outliers that cause timeouts. |
| GPU Utilization | Percentage of GPU compute units active. | Indicates efficiency. Low utilization means wasted money; high utilization risks thermal throttling. |
Note that NVIDIA GenAI-Perf introduces advanced concepts like TPS (Transactions Per Second) calculated via batch fashion rather than live metrics. It also excludes "warming up" and "cooling down" requests to improve accuracy. This is crucial because overhead from input generation and response storage can account for 33% of total benchmark duration in single-concurrency scenarios. Always isolate true model performance from infrastructure overhead.
Comparing Serving Stacks: vLLM, SGL, and Others
The landscape of LLM serving engines is evolving rapidly. vLLM has been the industry standard for continuous batching and PagedAttention. However, newer contenders are emerging. SGL (Structured Generation Language), for instance, has shown early promise in benchmarks conducted by Sider AI. In specific scenarios, SGL outperforms vLLM on certain metrics, particularly when dealing with structured outputs and complex constraint decoding.
But "best" depends on your workload. If you are running simple text completion, vLLM’s maturity and ecosystem support might outweigh marginal gains from newer tools. If you are building an agent-heavy application with strict JSON output requirements, SGL’s structured generation capabilities could reduce post-processing errors and latency. Always benchmark your specific model and workload. Do not rely on generic leaderboards.
Iterative Optimization: The Feedback Loop
Benchmarking is not a one-time event. It is an iterative process. You will tweak configurations, adjust concurrency settings, and rerun tests dozens of times. This is why developer experience (DevEx) matters. Slow feedback loops kill productivity.
To accelerate iteration:
- Use Caching: Avoid redownloading weights for billion-parameter models. Cache checkpoints locally.
- Automate Scripts: Create scripts like
startup_and_benchmark.shthat automate the start-measure-analyze cycle. - Monitor Continuously: Integrate Prometheus and Grafana to track metrics in real-time. Watch for performance degradation over time through interval comparison.
Shortening the time between configuration change and result acquisition directly improves benchmark quality. If it takes 20 minutes to restart a container and run a test, you will stop optimizing. If it takes 2 minutes, you will find the optimal sweet spot.
Production Validation: From Lab to Live
Lab benchmarks are necessary but insufficient. You must validate against production patterns. Segment’s testing of LLM-powered systems using real queries provides essential capacity planning data. Synthetic loads cannot replicate the chaos of real user behavior-typos, incomplete sentences, sudden traffic spikes.
Implement shadow testing. Route a small percentage of live traffic to your new serving stack without affecting users. Compare its performance against the current production system. This reveals hidden bottlenecks in networking, database connections, or upstream API dependencies that synthetic tests miss.
Finally, establish Service Level Objectives (SLOs). Define acceptable TTFT and latency thresholds (e.g., P50=0.75s, P90=2s). Use these targets to guide your infrastructure decisions. If a new model improves accuracy but doubles TTFT, does it meet your SLO? If not, it’s not ready for production, regardless of its benchmark scores.
What is the most important metric for LLM serving performance?
For user experience, Time-to-First-Token (TTFT) is the most critical metric. It determines how quickly a user sees the start of a response. For cost efficiency, Queries Per Second (QPS) or Tokens Per Dollar is key. Both must be balanced based on your application type.
Should I use client-side or server-side benchmarking?
Use both. Server-side benchmarking isolates hardware performance and eliminates network noise, making it ideal for comparing GPUs or models. Client-side benchmarking includes network latency and reflects the actual user experience, making it essential for production readiness.
How do I simulate realistic loads for LLMs?
Combine synthetic data with production traffic patterns. Vary context lengths, include cache hits and misses, and simulate concurrent users. Use tools like TensorMesh or custom scripts to create tiling patterns that stress the KV cache and mimic real-world usage.
Why should I ignore the first 30-60 seconds of a benchmark?
The initial period includes "cold start" effects such as GPU driver initialization, kernel compilation, and memory allocation. These factors skew latency and throughput metrics. Steady-state performance, measured after warm-up, provides a more accurate representation of ongoing system behavior.
Is vLLM always the best choice for LLM serving?
No. While vLLM is highly optimized for general-purpose text generation, newer frameworks like SGL may offer advantages for specific workloads, such as structured output generation or complex constraint decoding. Always benchmark your specific model and use case to determine the best fit.