The Hidden Infrastructure Crisis Killing AI Startups: A Conversation with Flex AI's CEO

Post date

November 12, 2025

Post author

FlexAI

How Brijesh Tripathi is compressing 15 years of cloud evolution into 2 years for AI infrastructure

Small AI startups are dying—not from lack of innovation, but from infrastructure exhaustion.

While everyone focuses on model architecture and training data, a quieter crisis unfolds in the trenches: talented teams spending weeks configuring Kubernetes clusters, burning through runway on idle GPU capacity, and becoming DevOps experts when they set out to build AI applications.

Brijesh Tripathi has seen this problem from every angle. After 25 years at Nvidia, Apple, Tesla, and Intel—where he led the development of Aurora, one of the world's most powerful supercomputers—he founded Flex AI with a mission to solve what he calls "the missing layer" in AI infrastructure.

In a recent conversation on the AI Engineering Podcast with host Tobias Macy, Tripathi shared insights on the infrastructure challenges plaguing AI development and how his team is addressing them with their "workload as a service" platform.

The Aurora Insight: Software is Harder Than Hardware

Tripathi's journey to founding Flex AI began with an unexpected realization while delivering Aurora to Argonne National Lab. The supercomputer—currently one of the highest-power computers ever built—was a hardware marvel destined for groundbreaking research in weather, drug discovery, and nuclear science.

But getting it running wasn't the hard part.

"As I was finishing it, I realized that hardware is okay, it's hard, but actually the word software is a misnomer because software was the hardest thing," Tripathi explains. "The challenge was not just getting it up and running, but really getting customers and users to take advantage of this massive capacity. And it came from not having the right infrastructure management tools."

This insight contradicts the conventional wisdom in AI infrastructure. While the industry obsesses over GPU availability and chip specifications, the real bottleneck is often the software layer that sits between hardware and applications.

The problem manifests in several ways:

  • Setup time: The industry struggles for 2-3 months to get GPU clusters operational, even for straightforward use cases
  • Specialized knowledge: Each cloud provider requires different configurations, turning AI engineers into infrastructure experts
  • Utilization waste: Most organizations see only 20-30% GPU utilization, wasting enormously expensive resources
  • Cost unpredictability: Renting GPUs by the month when you need them by the hour burns through capital

The Death of "GPU as a Service"

When Flex AI launched two years ago, "GPU as a service" was the industry buzzword. But Tripathi saw through the marketing: it was just bare metal with extra steps.

"What it was was plain and simple renting GPUs that only you have access to," he notes. "You start renting a GPU that you have to now go build a stack on top of it to make sure that it can run whether you're trying to run training on it, pre-train, fine-tuning, or you are trying to deploy an inference server on it—and everything will require a very different setup."

The result? Teams spending days to weeks configuring infrastructure for each new workload, then tearing it down and starting over for the next one.

"While you need to do that to be able to actually build what you're building, you were wasting a ton of time just setting it all up," Tripathi says. "So what we are trying to do with workload as a service is you worry about coming up with interesting ideas and deploying them, and we take care of the rest of it."

The 90/10 Prediction: Inference is the Future

One of Tripathi's most compelling insights concerns the shift from training to inference workloads.

"As this industry settles down, I think the ratio is going to be 90 to 10," he predicts. "90% compute is going to be spent on inference, 10% on training. It's going to be a handful of consolidated players who do big models, but the rest of them are going to be either an optimized version or a fine-tuned version or a reduced-size version of those similar models for their specific use cases."

This shift has profound implications for infrastructure:

For training: A handful of companies (OpenAI, Meta, xAI, Google, Anthropic) will build foundation models on massive Nvidia clusters. For these workloads, dedicated infrastructure makes sense—when training runs for months at 100% utilization, scheduler optimization matters less.

For inference: The rest of the market will serve these models to users. This is where Flex AI's value proposition shines. Inference workloads are:

  • Spiky: High demand peaks with long idle periods
  • Cost-sensitive: Margins depend on efficient serving
  • Architecture-agnostic: OpenAI API standardization means workloads can run on different hardware
  • Multi-tenant: Many customers can share the same infrastructure

"Companies that are actually doing extremely well on their ARR numbers—you know, raise $200 million revenue and all that—the cost of serving that is extremely high," Tripathi observes, citing the example of one company with $100 million ARR spending $87 million on infrastructure costs.

Technical Innovations: Beyond Kubernetes

While Kubernetes promised to abstract away infrastructure complexity, Tripathi argues it fell short for AI workloads.

"Theoretically that was the promise made, but unfortunately the dependencies, the libraries, the overall complexity of actually starting from N number of Kubernetes implementations—not everybody offers the exact same abstraction," he explains.

Flex AI addresses this with several technical innovations:

1. Self-Healing Infrastructure

When training on thousands of GPUs, node failures are inevitable. Traditional approaches require restarting from the last checkpoint, potentially losing hours of work.

"We have developed a solution that has zero cost to your training and yet zero impact when something fails because we'll quickly replace a failing node and let you continue on your journey without wasting any cycles," Tripathi explains.

The key is seamless checkpointing in the background with near-zero performance cost, combined with automatic node replacement.

2. Multi-Tenancy for GPUs

CPUs have supported virtual machines for decades—multiple workloads sharing the same hardware. GPUs traditionally haven't.

Flex AI enables training and inference workloads to coexist on the same infrastructure. "Given where the demand is from inference, the training can actually autoscale to a different size if there's a QoS there that allows it to be downgraded a bit," Tripathi notes.

If you're running fine-tuning over several days, slowing down slightly during peak inference hours doesn't impact your business—but it dramatically improves overall utilization.

3. Intelligent Caching for Multi-Cloud

One customer had data in a hyperscaler and wanted to use cheaper compute in a neocloud, but egress fees (tens of thousands of dollars) made it uneconomical.

Flex AI's architecture caches data between cloud storage and on-node storage. "The first time we put this in a cache, and the caching was done between a cloud storage and on-node storage. But once the first step was done, there was no more egress fees," Tripathi explains.

4. Heterogeneous Compute Orchestration

Different AI workloads benefit from different architectures. Flex AI abstracts this complexity:

  • Training: Primarily Nvidia (where the ecosystem is most mature)
  • High-throughput inference: AMD (better throughput per dollar)
  • Cost-optimized inference: Tenstorrent and emerging architectures (6-10x more cost-effective)
  • Edge deployment: ASICs and small SoCs (power-constrained environments)

"We are moving away from specific architecture optimization, and most of these inference solutions endpoints are being called with a standard OpenAI API," Tripathi notes. "Once we go into that level of abstraction, the implementation details per architecture is actually controlled by us—the developer or the user doesn't have to know which architecture it's going to run on."

Priority-Based Orchestration: Scheduling for AI

At its core, Flex AI applies classic computer architecture concepts to GPU workloads.

"If you have studied computer architecture, there is a whole concept of scheduling where the entire purpose is to make sure every cycle of the CPU is busy," Tripathi explains. "Your entire job in scheduling is to make sure every cycle is being used and used for the right purposes."

Flex AI lets customers define priorities for workloads:

  • Real-time: No interruptions, highest priority (user-facing inference)
  • High priority: Business-critical training that impacts revenue
  • Best effort: Find the cheapest resources, run when capacity is available

The system continuously optimizes, even using checkpoints as opportunities to preempt long-running training jobs if higher-priority workloads need capacity.

"Recently, a few weeks ago, we had what I call this the magical point," Tripathi recalls. "We had multiple customers on our platform, multiple compute clusters from various providers, and multiple workloads—and all of that distributed amongst all these different GPUs consuming 100% of it, everybody happily running and moving forward."

That moment—multiple customers, multiple clouds, 100% utilization—represented proof that the orchestration approach works.

The Customer Experience: From Idea to Production

Flex AI serves customers across the spectrum, from individual developers to enterprises, through multiple interfaces:

For Technical Users: CLI

Experienced developers can launch workloads via command line, pointing the platform to their data and specifying GPU requirements. "You tell us how many GPUs you need, and within two clicks, you're able to now start a training, and you pay for work done and not rental cost for weeks or months," Tripathi says.

For Application Builders: Blueprints

For teams without deep ML infrastructure experience, Flex AI offers pre-built blueprints for common use cases:

  • Smart search
  • Voice transcription
  • Multi-cloud migration
  • Media image playgrounds

"We pick some standard models, open source models from Hugging Face. We have some dummy data or reference data that we have used from the open source world, and we build some of these applications for them to use for their own use case," Tripathi explains.

Users can swap in different models and their own data, creating customized applications without infrastructure expertise.

For Edge Cases: Containers

For the estimated 10% of use cases that don't fit the managed service model, Flex AI accepts custom containers. "You can bring your own containers and make it part of our scheduling system," Tripathi notes.

The container still benefits from the orchestration layer—priority scheduling, multi-cloud distribution, automatic scaling—without requiring the team to manage infrastructure.

Real-World Impact: Customer Stories

The platform's impact shows up in customer experiences:

The YC Demo: One Y Combinator company struggled for weeks to get infrastructure running for their demo day presentation. After reaching out to Flex AI, "we got them up and running within a couple of hours, and they were able to actually set this up and go do their demos."

The Egress Cost Savings: A customer with large datasets in a hyperscaler wanted to use cheaper compute elsewhere. The egress fees made it uneconomical—until Flex AI's caching architecture eliminated repeated transfer costs. "That saved a ton of money for them, and they were like, 'We would actually go with this solution because now we have access to all the compute in the world that is the right size, that is the right price, and that is the right capacity for us.'"

The Utilization Achievement: Teams that previously saw 20-30% GPU utilization are now hitting 70-80% through better orchestration, multi-tenancy, and heterogeneous compute strategies.

What's Out of Scope (and Why That Matters)

Tripathi is refreshingly honest about where Flex AI doesn't compete.

High end: Beyond 10,000 GPUs, companies are likely focused on single workloads (training massive foundation models) and should build dedicated infrastructure teams to extract every ounce of performance.

Infrastructure experts: Teams whose core competency is infrastructure management and who need complete control over micro-optimizations probably don't need the abstraction layer.

"We said there is going to be a catch-all solution, and that's going to be containers," Tripathi explains. "If it becomes important enough, we might actually make it part of our platform managed services offering, but for now you can pack—we can either help you package it in a container, and then now that becomes the workload."

This focus on the 90% use case—teams who want to build AI applications without becoming infrastructure experts—is strategic.

The Road Ahead: Inference Autoscaling and Blueprints

Looking forward, Tripathi is excited about several areas:

Inference Autoscaling: "We have a tool that we have developed that is a simulator for the size or the capacity of the cluster needed for a specific throughput required for a given model. We're now applying that to start autoscaling the capacity for a given user or a given workload."

The system will deploy new nodes within seconds of detecting increased traffic, providing the most cost-optimized solution for spiky inference workloads.

More Blueprints: Expanding the library of pre-built application templates to enable even faster time-to-deployment. "You have an idea or a concept, you come on our platform and you should be able to just use our widgets and platform and just deploy it within hours or a day."

New Architectures: Continuing to integrate emerging compute options—TPUs, AMD, Tenstorrent, Trainium, and others—giving customers access to the best price-performance for each workload.

The Bigger Picture: Democratizing AI Infrastructure

Beneath the technical innovations runs a deeper mission: making AI infrastructure accessible.

"If your cost of delivering AI solutions is so high that only certain segments of society can reach it, that means you reduce access," Tripathi argues. "While there is a business case to be made in terms of reducing cost so you can improve your bottom line, there's also a really important task here, which is reducing the cost of accessing AI."

He points to Airtel in India offering Perplexity free to all subscribers as an example. "That is something that makes sense. It is the right thing to do. However, if it is not affordable and it continues to actually cost tons and tons of money to deliver that value, it's not a sustainable model."

The cost must come down—not just for business sustainability, but for democratizing access to AI capabilities globally.

The Success Metric That Matters

When asked about success metrics, Tripathi focuses on something simple but profound:

"In my mind, the success metric for us is going to be when our users can claim that we haven't dealt with infrastructure issues in the last so many months that we have been working with Flex AI. That's going to be a success for us."

Not revenue milestones. Not GPU counts. Not even technical metrics like utilization percentages.

Success is measured by infrastructure problems that don't happen—by the weeks of DevOps work that talented AI teams never have to do, by the runway that doesn't get burned on idle capacity, by the iterations that happen faster because infrastructure just works.

Conclusion: The Infrastructure Layer That Should Have Always Existed

Tripathi's insight from the Aurora project—that a missing infrastructure management layer was preventing users from accessing compute—has proven prescient.

As AI moves from research to production, from training to inference, from foundation models to specialized applications, the infrastructure complexity only increases. The teams best positioned to succeed won't necessarily be those with the most GPUs or the biggest models.

They'll be the ones who can iterate fastest, deploy efficiently, and focus their talent on solving actual problems rather than managing infrastructure.

"I would like my fellow startup founders to focus on what they started the company for and leave the DevOps and the infrastructure management to us," Tripathi says. "So they can get the most value out of their teams and the money they raised."

In compressing 15 years of cloud computing evolution into 2 years for AI infrastructure, Flex AI isn't just building a platform. They're building the layer that should have existed from the beginning—the one that lets brilliant teams build brilliant things without becoming infrastructure experts first.

Listen to the full conversation between Brijesh Tripathi and Tobias Macy on the AI Engineering Podcast.

FlexAI Logo

Get Started Today

To celebrate this launch we’re offering €100 starter credits for first-time users!

Get Started Now