Don’t Wing Your Infra - A Developer’s Guide to Sizing GPUs for AI Inference

Post date

September 5, 2025

Post author

FlexAI

When developers scale LLM workloads to production, there's one question that always comes up: Which GPUs should I use, how many will I need and how much is this going to cost me? Not an average or back-of-the-envelope guess: real numbers that reflect the latency you need, the model you’re running, and the traffic you expect.

It’s a deceptively simple question, but the answer depends on a multitude of factors:

  • token lengths
  • request volume
  • GPU type
  • model architecture
  • load profile
  • ...and more.

This is only made more complicated by the fact that most infra calculators assume uniform conditions that don’t match real-world inference and don't offer a solution to compare the various possibilities at hand.

Sizing for LLM inference is more nuanced than just checking if your model fits in memory. Model architecture (dense vs MoE), token throughput, concurrency, and the quirks of GPU memory bandwidth all shape your real-world capacity. Even the most detailed spec sheets and calculators (like those from VMware or NVIDIA1) can’t fully capture the dynamic nature of production traffic—burst loads, variable prompt sizes, and shifting latency requirements.

FlexAI’s Inference Sizer is a developer-first tool that translates your workload parameters into actionable GPU requirements, factoring in the details that matter in production. No sign-up walls, no black-box estimates. Just clear, model-aware answers.

With FlexAI, Serving LLMs Isn’t a Science Project Anymore

Inference has gone from a research afterthought to a core infra problem. Developers need answers to very practical questions:

  1. Can I serve Qwen-14B at 500 RPS without exceeding 200ms latency?
  2. Will a single A100 suffice for my summarization endpoint, or do I need to scale out?
  3. Is L4 a viable alternative to H100 for my Gemma UI?
  4. What’s my fallback if the GPU I want isn’t available?

The FlexAI Inference Sizer answers these by letting you specify your LLM (from HuggingFace), input/output token sizes and volume of requests per second. The tool is based on internal benchmarking done with FlexBench, our internal tool built to publish SOTA MLPerf results. It simulates real-world behavior and with FlexAI´s auto-scaling capability (in beta), you just need to know your maximum number of requests per second and the calculator will output a deployment-ready GPU configuration. You can adjust for cost optimized output, latency optimized, TTFT optimized, etc.

Real-Life GPU Sizing Example: How it Works

Let’s say you’re building a production-ready internal chatbot using Meta’s LLaMA 3.1 8B Instruct model. Your expected workload:

  • Input: 256 tokens (multi-turn chat context)
  • Output: 128 tokens (typical assistant response)
  • Target RPS: 10 requests per second
Inference Sizer

When you hit Calculate Resource Requirements, here’s what you get:

Recommended GPU: 1× L40

  • Projected Throughput: ~181K tokens/sec
  • Max QPS Capacity: ~19.3
  • End-to-End Latency: ~4 sec

This setup is cost-efficient and sufficient for internal tools or moderately interactive applications.

Hardware Recommendations

Latency Tradeoff: H100 vs. H200

However, if you're optimizing for faster response times, the Sizer lets you compare:

GPU Time to First Token Time per Output Token End-to-End Latency Relative Cost
H100 ~12.5 ms ~7.3 ms ~934 ms Lower
H200 ~11.4 ms ~5.3 ms ~679 ms Higher

Let’s say your chatbot generates a 128-token response:

  • On H100, it would take ~934 ms
  • On H200, it would take ~679 ms

That’s about a 250 ms difference end-to-end. In a chat context, where tokens are streamed, this difference is rarely noticeable. Users perceive responsiveness from how quickly the first tokens arrive, not how long the last ones take.

Cost vs. Value

The H200 can be 30–40% more expensive depending on your cloud provider and runtime. For an internal chatbot:

  • Streaming hides most of the latency gap (users see responses start almost immediately).
  • The 250 ms improvement won’t materially change the user experience.
  • H100 already meets a sub-1 second E2E latency target.

What FlexAI helps you decide:

  • If sub-800 ms E2E latency at p95 is mission critical (e.g. real-time customer support or multi-hop agent chains), H200 could be justified.
  • But for a cost-sensitive internal tool or async agent, H100 gives you nearly identical UX at a significantly better price point.

Whether you’re building for chat, RAG, summarization, or interactive UI flows, the FlexAI Inference Sizer helps you right-size your infrastructure: balancing cost, latency, and throughput in one clear view.

Deploy your model with one-click using free credits from FlexAI to benchmark your workload before committing or just try the sizer by signing up.

What Makes this FlexAI Inference Sizer Different?

FlexAI bridges the planning-to-deployment gap that often slows down AI teams.

Most AI infrastructure platforms promise scale, but few deliver true flexibility or developer autonomy.

FlexAI’s Workload as a Service solution was designed from the ground up by engineers from Apple, Intel, NVIDIA, and Tesla. FlexAI abstracts away infrastructure management with autoscaling, GPU pooling, and intelligent placement—so your models get the resources they need, when they need them.

Most calculators stop at “does it fit?” → FlexAI goes further, suggesting alternative GPU types, and integrating directly with deployment
Preferred GPU out of stock? → FlexAI gives you a fallback plan
Variable workload? → FlexAI offers auto-scaling options to match your unpredictable volume elasticity.
Not sure how much it will cost? → FlexAI was built for developers so you can go from sizing to live endpoint in one flow… no sales calls, no guesswork
Got hyperscaler credits? → Deploy on your own cloud (BYOC), or on FlexAI cloud, no quota calls required. If you're a start-up check out our start-up acceleration program
Compliance constraints? → FlexAI provides full support for compliance-constrained environments, such as EU–based companies.

FlexAI solves the real bottlenecks of modern AI deployment: infra complexity and fragility, underutilized compute, and a lack of transparency into costs.

Deploy Inference with Confidence

Serving LLMs shouldn't require guesswork. Whether you're a startup experimenting with open-weight models or a platform team pushing to meet SLA requirements, FlexAI’s Inference Sizer gives you the numbers you need—and the ability to act on them.

Next steps

We´re not stopping here. In the coming months, we will add more models to this sizer (including Diffusion Models), turn it into a full fledged co-pilot functionality to use as you launch your inference workloads. Finally, expect to see the same for fine-tuning and training. Stay put!

Need help or want to compare setups? Join our Slack or drop us a line at support@flex.ai.

Use FlexAI’s Inference Sizer Now

You’ll get GPU estimates, alternative configs, and the ability to spin up endpoints instantly.

Better yet, every sign-up comes with $/100 FlexAI credits so you can test out your deployments.

FlexAI Logo

Get Started Today

To celebrate this launch we’re offering €100 starter credits for first-time users!

Get Started Now