Skip to content
    DevOpsLLMMLPerfBenchmarkingGPUInferenceOpen SourceDeepSeekHugging Face

    FlexBench: An Open-Source, Modular MLPerf Benchmark for LLM Inference

    March 31, 20266 min read

    TL;DR

    AI system benchmarks like MLPerf struggle to keep pace with the rapidly evolving model landscape, making it difficult for organizations to make informed deployment decisions. We believe benchmarking should itself be treated as a machine learning problem — one where models are continuously tested and optimized across datasets, software, and hardware based on metrics like accuracy, latency, throughput, power consumption, and cost.

    That's why we built FlexBench: a modular, open-source version of the MLPerf LLM inference benchmark connected to Hugging Face. FlexBench aggregates existing and new benchmarking results into an Open MLPerf dataset, which can be collaboratively cleaned, extended, and used for predictive modeling. We validated FlexBench through our MLPerf Inference 5.0 submission, benchmarking DeepSeek R1 and LLaMA 3.3 on commodity H100 servers.

    Our long-term goal: empower teams to make cost-effective AI deployment decisions based on their available resources, requirements, and constraints.

    Why MLPerf Falls Short for LLM Inference Benchmarking

    AI service providers, server developers, and data center operators face a critical challenge: selecting the right hardware and software stack to ensure ROI within 3–5 years in a rapidly shifting landscape [1]. MLPerf was introduced as a full-stack inference benchmark to evaluate accuracy, latency, and throughput in a standardized, reproducible manner across diverse hardware and software stacks [2].

    But traditional benchmarks face a fundamental limitation: the combinatorial explosion of models, datasets, methods, and hardware configurations. Hugging Face alone hosts over a million ML models, more than 10,000 datasets, and thousands of methods — while new (and often incompatible) hardware and software ship continuously. Exploring all possible configurations is not only impractical, it's prohibitively expensive.

    MLPerf currently covers only a limited set of combinations — typically around a dozen — and updates just once a year. Its LLM benchmarks still focus on models like BERT, GPT-J, LLaMA 2 70B, LLaMA 3 405B, and Mixtral 8x7B, even as newer models like DeepSeek dominate production workloads. Worse, our hands-on experience with MLPerf shows that heavily over-optimized results from a few chip manufacturers are rarely achievable out of the box on other models, software versions, or hardware — significantly limiting their practical usefulness.

    Reframing GPU Benchmarking as a Machine Learning Problem

    We believe a fundamentally different approach is needed. Drawing on our past experience using AI to improve computer systems, we propose redefining MLPerf benchmarking as a learning task — with an open dataset of results and trainable objective functions to optimize key metrics such as accuracy, latency, throughput, power consumption, and cost [3][4][5].

    To support this vision, we developed FlexBench — an open-source, modular, and flexible version of the MLPerf language inference benchmark connected to the Hugging Face Hub. With a unified codebase and CLI, users can benchmark a wide range of models and datasets by adjusting just a few input parameters. FlexBench is designed for continuous evolution.

    We use the MLCommons CMX workflow automation framework to aggregate both existing and new benchmarking results — along with their associated metadata — into an open MLPerf dataset published on GitHub and Hugging Face. This dataset can be collaboratively cleaned, extended, and analyzed using standard data analytics techniques, including predictive modeling and feature engineering. We then use FlexBoard to visualize, compare, and predict the most suitable software/hardware configurations for different models based on user requirements and constraints.

    FlexBench Architecture: Client-Server LLM Inference Benchmarking

    FlexBench uses a client-server architecture where the FlexBench client connects to a running vLLM server. It's built on MLPerf LoadGen, the official and reusable MLPerf harness that efficiently and fairly measures inference system performance [2][9]. Our goal is to retain MLPerf's rigorous measurement standards while making the framework more flexible by abstracting models and datasets as interchangeable modules. Hugging Face or local LLMs and datasets can be used with minimal setup.

    FlexBench supports two standard MLPerf inference modes:

    • Server (streaming) mode: Queries arrive according to a Poisson distribution, mimicking real-world request patterns

    • Offline mode: All queries are sent simultaneously to maximize throughput

    FlexBench returns detailed metrics from LoadGen — including TTFT, throughput, and latency percentiles — all compliant with MLPerf standards and suitable for inclusion in the Open MLPerf dataset for further analysis and predictive analytics. We cross-validated these results against the vLLM benchmarking infrastructure and found strong alignment in performance numbers.

    FlexBench also provides accuracy metrics to guide further model optimizations such as quantization, pruning, and distillation. We've introduced a Query Per Second (QPS) sweep mode to help users automatically identify the optimal QPS for their specific model, software, and hardware combination.

    FlexBoard is implemented as a Gradio module that loads the Open MLPerf dataset via MLCommons CK/CM/CMX automations. It includes various predictive modeling and visualization plugins to help users analyze this data and predict the most efficient and cost-effective software/hardware configurations based on their requirements and constraints.

    Benchmarking DeepSeek R1 and LLaMA 3.3 on H100 GPUs

    We validated FlexBench through our MLPerf Inference 5.0 submission by benchmarking several non-MLPerf LLM models — including DeepSeek R1 and LLaMA 3.3 — on the OpenOrca dataset, using commodity servers equipped with NVIDIA H100 GPUs. Our automation framework enabled rapid switching between models, datasets, and hardware configurations by simply modifying command-line parameters, without requiring any code changes.

    We also invested significant effort assembling the Open MLPerf dataset by unifying past MLPerf Inference results (v4.0) and combining them with the latest official submissions and FlexBench data. To enable predictive modeling, we cleaned the dataset, standardized disparate fields, and engineered new features such as model size and data type.

    We've released this cleaned and curated dataset, along with FlexBoard and predictive analytics tools, to help the broader ML, AI, and systems community accelerate benchmarking, evaluation, and optimization efforts. For example, our proof-of-concept prototype allows users to input system costs and predict optimal software/hardware configurations based on model size and data type features.

    What's Next for FlexBench and Open MLPerf

    FlexBench and FlexBoard are still in early-stage prototyping. We invite researchers and practitioners to explore the tools, provide feedback, and collaborate on the following:

    • Extending FlexBench to support all types of models, datasets, and systems

    • Expanding the Open MLPerf dataset with FlexBench results from various models across diverse software/hardware configurations from different vendors

    • Engineering improved features — such as model graphs, tensor shapes, compiler optimizations, accelerator capabilities, and hardware topology — to enhance predictions for previously unseen AI workloads

    • Extending and improving FlexBoard based on user requirements and feedback

    • Sharing your specific benchmarking challenges to help guide our priorities

    Our long-term goal is to enable anyone to run AI models efficiently and cost-effectively, tailored to their available resources, requirements, and constraints. If you're interested in our approach or would like to collaborate, reach out to the authors at FCS Labs.

    References

    1. The MAD (ML, AI & Data) Landscape

    2. MLPerf Inference Benchmark (arXiv)

    3. Milepost GCC: Machine Learning Enabled Self-Tuning Compiler

    4. ACM TechTalk: Reproducibility

    5. Enabling More Efficient AI/ML Systems with Collective Mind (arXiv)

    6. Open MLPerf Dataset on Hugging Face

    7. FlexBench and FlexBoard (GitHub)

    8. MLCommons CK/CM/CMX Workflow Automation

    9. MLPerf LoadGen

    10. vLLM Benchmarking Infrastructure

    Get Started Today

    Start building with €100 in free credits for first-time users.