When developers scale LLM workloads to production, one question always comes up: which GPUs should I use, how many will I need, and how much is this going to cost me? Not a back-of-the-envelope guess — real numbers that reflect the latency you need, the model you're running, and the traffic you expect.
It's a deceptively simple question, but the answer depends on a web of factors: token lengths, request volume, GPU type, model architecture (dense vs. MoE), load profile, and more. Most infrastructure calculators assume uniform conditions that don't match real-world inference and don't let you compare alternatives side-by-side.
Sizing for LLM inference is more nuanced than checking whether your model fits in memory. Token throughput, concurrency, and the quirks of GPU memory bandwidth all shape real-world capacity. Even detailed spec sheets and calculators (like those from VMware or NVIDIA) can't fully capture the dynamic nature of production traffic — burst loads, variable prompt sizes, and shifting latency requirements.
FlexAI's Inference Sizer is a developer-first tool that translates your workload parameters into actionable GPU requirements, factoring in the details that matter in production. You specify your model (from Hugging Face), input/output token sizes, and request volume — and it outputs a deployment-ready GPU configuration. The tool is powered by FlexBench, our open-source benchmarking framework built to produce MLPerf-grade results on commodity hardware. With FlexAI's auto-scaling capability (in beta), you provide your maximum requests per second and the tool handles the rest — optimizing for cost, latency, or TTFT depending on your priority.
GPU Sizing Example: LLaMA 3.1 8B on L40
Let's say you're building a production-ready internal chatbot using Meta's LLaMA 3.1 8B Instruct model. Your expected workload:
-
Input: 256 tokens (multi-turn chat context)
-
Output: 128 tokens (typical assistant response)
-
Target RPS: 10 requests per second
The Inference Sizer recommends 1× L40 with the following projected performance:
| Metric | Value |
|---|---|
| Projected Throughput | ~181K tokens/sec |
| Max QPS Capacity | ~19.3 |
| End-to-End Latency | ~4 sec |
This setup is cost-efficient and sufficient for internal tools or moderately interactive applications. But what if you need faster response times?
LLM Inference Latency Tradeoff: H100 vs. H200
For the same LLaMA 3.1 8B workload, here's how H100 and H200 compare on key latency metrics:
| GPU | Time to First Token | Time per Output Token | End-to-End Latency | Relative Cost |
|---|---|---|---|---|
| H100 | ~12.5 ms | ~7.3 ms | ~934 ms | Lower |
| H200 | ~11.4 ms | ~5.3 ms | ~679 ms | Higher |
For a 128-token response, that's roughly a 250 ms difference end-to-end. In a chat context where tokens are streamed, this gap is rarely noticeable to users — perceived responsiveness comes from how quickly the first tokens arrive, not how long the last ones take.
When to Pay More for GPU Inference Performance
The H200 can be 30–40% more expensive depending on your cloud provider and runtime. For an internal chatbot, there are three reasons the H100 is usually the better call:
-
Streaming hides the latency gap. Users see responses start almost immediately on both GPUs.
-
250 ms won't change the UX. The perceptual difference at this scale is negligible.
-
H100 already meets sub-1s E2E latency. That's within SLA range for most internal tools.
The H200 starts to make sense when sub-800 ms E2E latency at p95 is mission-critical — real-time customer support, multi-hop agent chains, or latency-sensitive API endpoints. For cost-sensitive internal tools or async agents, H100 delivers nearly identical UX at a significantly better price point.
Handling GPU Availability, Variable Traffic, and Compliance
In practice, GPU sizing isn't just a latency-and-cost optimization. Teams also deal with:
-
GPU availability: Your preferred SKU may not be available in your region or cloud. The Inference Sizer suggests alternative GPU configurations so you always have a fallback plan.
-
Variable workloads: Traffic to inference endpoints is rarely uniform. FlexAI's auto-scaling matches your workload elasticity without over-provisioning.
-
BYOC flexibility: If you have hyperscaler credits (AWS, GCP, Azure), you can deploy on your own cloud or on FlexAI's — no quota calls required.
-
Compliance constraints: For EU-based companies or other compliance-constrained environments, FlexAI supports region-locked deployments.
Whether you're building for chat, RAG, summarization, or interactive UI flows, the goal is the same: right-size your infrastructure to balance cost, latency, and throughput.
What's Next
We're expanding the Inference Sizer to support additional model types, including diffusion models, and evolving it into a co-pilot that assists as you launch inference workloads. Fine-tuning and training sizing are on the roadmap as well.
If you want to try the sizer or compare GPU configurations for your workload, sign up for FlexAI — every account comes with $100 in credits to benchmark before committing.