Turning AI System Benchmarking into an AI Task: FlexBench and the Open MLPerf Dataset

Post date

June 27, 2025

Post author

Daniel Altunay, Machine Learning Engineer at FCS Labs, FlexAI Grigori Fursin, Head of FCS Labs, FlexAI

TL;DR

AI system benchmarks, such as MLPerf, often struggle to keep pace with the rapidly evolving and diverse AI landscape, making it difficult for organizations to make informed deployment decisions. We believe benchmarking should itself be treated as an AI problem—one where models are continuously tested and optimized across datasets, software, and hardware, right out of the box, based on key metrics like accuracy, latency, throughput, power consumption, and cost. That’s why we are building FlexBench: a modular version of the MLPerf LLM inference benchmark connected to Hugging Face and designed to provide users with relevant, actionable insights. We aggregate all existing and new benchmarking results and metadata into an Open MLPerfdataset, which can be collaboratively cleaned, extended, and leveraged for predictive modeling and feature engineering. FlexBench has been successfully validated through our MLPerf Inference 5.0 submission, benchmarking DeepSeek R1 and LLaMA 3.3 on commodity servers. Our long-term goal is to empower teams to make cost-effective AI deployment decisions based on available resources, requirements, and constraints.

Motivation

AI service providers, server developers, and data center operators face a critical challenge: selecting the right hardware and software to ensure a return on investment (ROI) within 3 to 5 years, all within a rapidly evolving AI landscape [1]. MLPerf was introduced as a full-stack inference benchmark to help them evaluate the accuracy, latency, and throughput of various models in a standardized, reproducible, and apples-to-apples manner across diverse hardware and software stacks from different vendors [2].

However, these traditional benchmarks face the same fundamental limitation—an overwhelming number of possible combinations of models, datasets, methods, and hardware configurations. Exploring all these possibilities is not only impractical but also prohibitively expensive. For instance, Hugging Face already hosts over a million ML models, more than 10,000 datasets, and thousands of methods, while numerous companies continue to introduce new—and often incompatible—hardware and software. This complexity makes comprehensive benchmarking one of the most significant challenges.

MLPerf benchmark covers only a very limited set of model, dataset, software, and method combinations—typically around a dozen. Updated only once a year and executed twice annually, MLPerf struggles to keep pace with the rapid advancements in AI. For instance, its current LLM benchmarks still focus on models like BERT, GPT-J, LLaMA 2 70B, LLaMA 3 405B, and Mixtral 8x7B, even as newer models like DeepSeek are making headlines. Furthermore, our extensive hands-on experience with MLPerf shows that the often heavily over-optimized results from just a few chip manufacturers are rarely achievable out of the box on other models, software versions, or hardware configurations—significantly limiting their practical usefulness.

Our FlexBench approach

We believe a fundamentally different approach is needed to address these challenges. Drawing on our past experience using AI to improve computer systems , we suggest that benchmarking can itself be framed as an AI problem [3,4,5]. We therefore propose redefining MLPerf benchmarking as a learning task with an open dataset of results and trainable objective functions to optimize key metrics such as accuracy, latency, throughput, power consumption, cost, and more.

To support our vision, we are developing FlexBench—an open-source, modular and flexible version of the MLPerf language inference benchmark connected with the Hugging Face Hub [7]. With a unified codebase and CLI, users can automatically benchmark a wide range of models and datasets by adjusting just a few input parameters. FlexBench is designed for continuous evolution.

We use the MLCommons CMX workflow automation framework [8,5] to aggregate both existing and new benchmarking results—along with their associated metadata (features)—into an open MLPerf dataset, published on GitHub and Hugging Face [6,7]. This dataset can be collaboratively cleaned, extended, and analyzed using standard data analytics techniques, including predictive modeling and feature engineering. We then use FlexBoard to visualize, compare, and predict the most suitable software/hardware configurations for different models based on user requirements and constraints [7].

Our FlexBench approach

Key Technical Details

FlexBench uses a client-server architecture where the client (FlexBench) connects to a running vLLM server. It is built on MLPerf LoadGen, an official and reusable MLPerf harness that efficiently and fairly measures the performance of inference systems [9,2]. Our goal is to retain MLPerf’s rigorous standards while making MLPerf more flexible by abstracting models and datasets as interchangeable modules. Hugging Face or local LLMs and datasets can be used with minimal setup.

Key Technical Details

FlexBench supports two standard inference modes as described in the MLPerf Inference paper [2]: Server and Offline. In Server (streaming) mode, queries arrive according to a Poisson distribution, mimicking real-world request patterns. In Offline mode, all queries are sent to the system simultaneously to maximize throughput:

Server vs. Offline

FlexBench returns detailed metrics from LoadGen — including TTFT, throughput, and latency percentiles — all compliant with MLPerf standards and suitable for inclusion in our Open MLPerf dataset for further analysis and predictive analytics. We have cross-validated these results with those obtained from the internal vLLM benchmarking infrastructure [10], and found strong alignment in performance numbers. In addition, FlexBench/MLPerf also provide accuracy metrics, which help guide further model optimizations such as quantization, pruning, and distillation. We have also introduced a Query Per Second (QPS) sweep mode in FlexBench to help users automatically identify the optimal QPS for their models, software, and hardware.

FlexBoard is implemented as a Gradio module that loads Open MLPerf dataset prepared by the MLCommons CK/CM/CMX automations. It includes various predictive modeling and visualization plugins to help users analyze and model this data to predict the most efficient and/or cost-effective software/hardware configurations for different models based on their requirements and constraints.

Preliminary Results

We validated our approach in the MLPerf Inference 5.0 submission by benchmarking several non-MLPerf LLM models—including DeepSeek R1 and LLaMA 3.3—on the OpenOrca dataset, using commodity servers equipped with widely used NVIDIA H100 GPUs. Our automation framework enabled rapid switching between models, datasets, and hardware configurations by simply modifying command-line parameters—without requiring any code changes.

We also spent a considerable amount of time assembling the Open MLPerf dataset by unifying past MLPerf Inference results (v4.0) and combining them with the latest data from official submissions and FlexBench. To enable predictive modeling, we cleaned the dataset, standardized disparate fields, and engineered new features such as model size and data type.

{
  "metrics.accuracy": "ROUGE1: 30.6202   ROUGE2: 13.9221   ROUGEL: 18.9101 TOKENS_PER_SAMPLE: 581.8",
  "metrics.result": 2631.93,
  "metrics.result_per_accelerator": 2631.93,
  "metrics.units": "Tokens/s",
  "model.architecture": "LLM",
  "model.mlperf_name": "llama2-70b-99",
  "model.name": "DeepSeek-R1-Distill-Llama-8B",
  "model.number_of_parameters": 8.0,
  "model.weight_data_types": "bfloat16",
  "software.framework": "vLLM v0.7.3",
  "software.operating_system": "Ubuntu 22.04.5 LTS (5.15.0-131-generic)",
  "submission.availability": "available",
  "submission.division": "open",
  "submission.organization": "FlexAI",
  "submission.scenario": "Server",
  "system.accelerator.count_per_node": 1,
  "system.accelerator.memory_capacity": null,
  "system.accelerator.memory_config": "HBM3",
  "system.accelerator.name": "NVIDIA H100 80GB HBM3",
  "system.accelerator.total_count": 1,
  "system.accelerator.vendor": "NVIDIA",
  "system.cpu.caches": "L1d cache: 6.3 MiB (200 instances), L1i cache: 6.3 MiB
(200 instances), L2 cache: 800 MiB (200 instances), L3 cache: 3.1 GiB (200 instances)",
  "system.cpu.core_count": 52,
  "system.cpu.count_per_node": 2,
  "system.cpu.frequency": null,
  "system.cpu.model": "Intel Xeon Processor (SapphireRapids)",
  "system.cpu.vcpu_count": null,
  "system.cpu.vendor": "Intel",
  "system.interconnect.accelerator": "NVLink",
  "system.interconnect.accelerator_host": "PCIe",
  "system.memory.capacity": null,
  "system.memory.configuration": "undefined",
  "system.name": "flexbench test node 0ef307db09d34a91 with 8xH100",
  "system.number_of_nodes": 1,
  "system.type": "datacenter"
}

We have released this cleaned and curated dataset, along with FlexBoard and predictive analytics tools, to help the broader ML, AI, and systems community accelerate benchmarking, evaluation, and optimization efforts. For example, our proof-of-concept prototype allows users to input system costs and predict optimal software and hardware configurations based on model size and data type features:

Preliminary Results

What’s Next

Our open-source tools, FlexBench and FlexBoard, are still in the early stages of prototyping [7]. We invite researchers and practitioners to explore our technology, provide feedback, and collaborate on the following topics:

  • Extending FlexBench to support all types of models, datasets, and systems
  • Expanding the Open MLPerf dataset with more FlexBench results from various models across diverse software and hardware configurations from different vendors
  • Engineering improved features—such as model graphs, tensor shapes, compiler optimizations, accelerator capabilities, and hardware topology—to enhance predictions of optimal software/hardware configurations for previously unseen AI workloads
  • Extending and improving the FlexBoard based on user requirements and feedback
  • Telling us about your specific benchmarking challenges to guide our priorities

Our long-term goal is to enable anyone to run AI models efficiently and cost-effectively, tailored to their available resources, requirements, and constraints. If you're interested in our AI-powered, flexible benchmarking approach or would like to collaborate, please reach out to the authors at FCS Labs!

References

[1] The MAD (ML, AI & Data) Landscape, https://mad.firstmark.com
[2] MLPerf Inference Benchmark, https://arxiv.org/abs/1911.02549
[3] Milepost GCC: Machine learning enabled self-tuning compiler, https://doi.org/10.1007/s10766-010-0161-2
[4] https://learning.acm.org/techtalks/reproducibility
[5] Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments, https://arxiv.org/abs/2406.16791
[6] Open MLPerf dataset on HuggingFace: https://huggingface.co/datasets/daltunay/OpenMLPerf
[7] FlexBench and FlexBoard GitHub: https://github.com/flexaihq/flexbench
[8] MLCommons CK/CM/CMX workflow automation technology: https://github.com/mlcommons/ck
[9] MLPerf loadgen, a reusable module that efficiently and fairly measures the performance of inference systems: https://github.com/mlcommons/inference/tree/master/loadgen
[10] vLLM benchmarking infrastructure, https://github.com/vllm-project/vllm/tree/main/benchmarks