Question 1

What is the serverless/dedicated crossover point?

Accepted Answer

The crossover point is the monthly token volume at which a dedicated GPU endpoint becomes cheaper than paying per token on a serverless API. Below the crossover, serverless is cost-efficient. Above it, a dedicated endpoint running at a flat hourly rate costs less per token than the per-token serverless markup.

Question 2

Why is dedicated cheaper at high volume?

Accepted Answer

Serverless inference adds a margin on every token to cover shared infrastructure and provider overhead. A dedicated GPU charges only for compute time — the monthly cost is fixed regardless of how many tokens you generate. At high enough volume, that fixed cost amortizes below the per-token serverless rate.

Question 3

Which GPU does FlexAI use for each model?

Accepted Answer

Configurations are derived from each model's memory footprint at the most efficient supported precision. Small models (≤ 30B) fit on a single NVIDIA H100 at FP16. Dense 70B and 100B-class MoE models fit on a single AMD Instinct MI300X (192 GB HBM3) at FP16/FP8 without multi-GPU interconnect overhead. Very large MoE models (480B–671B) scale out to multi-GPU MI300X clusters. As B200 capacity comes online, fewer GPUs per deployment become possible for the largest models.

Question 4

Can I scale a dedicated endpoint down to zero?

Accepted Answer

Yes. FlexAI dedicated endpoints use per-second billing and scale to zero between requests. You only pay while the endpoint is actively serving traffic — there are no reserved capacity fees or hourly minimums.

Question 5

How accurate are the serverless rates shown?

Accepted Answer

Rates are approximate public list prices reviewed in April 2026. Provider pricing changes frequently — check current rates on each provider's pricing page before making a purchasing decision. This calculator is most useful for directional guidance; exact break-even depends on your actual negotiated rates.

Should I use a serverless or a dedicated endpoint?

Your setup

Token volume

Dedicated on FlexAI isn't just about cost

Per-second billing

No shared rate limits

Your fine-tuned models

Your data, your VPC

Two cost models, one crossover

Serverless: variable cost

Dedicated: fixed cost

When does serverless inference get expensive?

Serverless vs. dedicated — common questions