Skip to content
    Serverless vs. dedicated endpoint calculator

    Should I use a serverless or a dedicated endpoint?

    At low volume, serverless is the obvious choice. At scale, dedicated beats per-token pricing. Find the exact crossover for your model and monthly token volume.

    Compare FlexAI dedicated endpoints against FlexAI's serverless options and other providers including OpenRouter, Together AI, and Fireworks.

    Your setup

    Token volume

    5B tokens
    1M10B

    Serverless

    $1,215

    per month

    Dedicated

    $1,533

    per month

    Dedicated config:1× NVIDIA H100 SXM
    Break-even6.31B tokens/mo
    1M10B tokens/mo
    at 5B tokens / month

    Serverless wins at this volume

    Cross over to FlexAI dedicated at 6.31B tokens/month, at double the crossover volume you'd save $1,533/mo.

    Serverless rates come from the FlexAI catalog's cited public market sources, blended at a 70/30 input/output mix. Dedicated costs assume 1× NVIDIA H100 SXM for 730 hrs/month. Actual savings vary by workload and configuration.

    Dedicated on FlexAI isn't just about cost

    Even below the crossover, dedicated endpoints give you things serverless APIs can't offer.

    Per-second billing

    Scale to zero between requests. Pay only for active compute, not provisioned time.

    No shared rate limits

    Your endpoint, your throughput. No queueing behind other tenants at peak.

    Your fine-tuned models

    Serve LoRAs and full fine-tunes without serverless catalog restrictions.

    Your data, your VPC

    Deploy on FlexAI or bring your own. Inference never leaves your boundary.

    Two cost models, one crossover

    Serverless: variable cost

    You pay per token. Cost scales linearly with volume: predictable, but the per-token price bundles infrastructure, operations, and provider margin.

    monthly = tokens_M × rate_per_M

    Dedicated: fixed cost

    You lease a GPU configuration for the month. Cost is flat regardless of how many tokens you generate. Above the crossover, the per-token effective rate falls below any serverless price.

    monthly = gpu_count × rate × 730 hrs

    When does serverless inference get expensive?

    At low token volumes, serverless GPU inference is the obvious choice: no upfront commitment, no infrastructure to manage, and you pay only for what you use. But serverless pricing bundles infrastructure, operations, and provider margin into every token. At scale, that overhead adds up.

    Dedicated inference flips the model. You lease a fixed GPU configuration for the month and your endpoint processes as many tokens as the hardware allows. The effective per-token cost falls as volume grows, and above the crossover point, dedicated consistently undercuts even competitive serverless pricing.

    The break-even threshold varies widely by model. A smaller model with a low market rate has a higher crossover. You need more volume before dedicated makes sense. A large model with aggressive serverless pricing may cross over at a surprisingly modest volume. This calculator shows you the exact threshold for your workload in two fields.

    Serverless vs. dedicated: common questions

    Still have a question? Talk to an expert