From Supercomputers to Serverless: How FlexAI is Solving the GPU Infrastructure Challenge

This blog summarizes Brijesh Tripathi's appearance on the AI Chat podcast with host Jaeden Schafer.

The AI infrastructure landscape is at a critical inflection point. While companies race to build larger models and acquire more GPUs, a fundamental challenge remains largely unsolved: how do we actually make this expensive infrastructure efficient, accessible, and easy to use?

Brijesh Tripathi, CEO of FlexAI, knows this problem intimately. After spending 20-25 years at companies like Nvidia, Apple, Tesla, and Intel (where he led the development of Aurora, one of the world's most powerful supercomputers) he discovered a surprising truth: hardware isn't the hardest part. Software is.

The Infrastructure Management Problem

While leading Intel's Aurora supercomputer project, Tripathi witnessed firsthand how difficult it was for customers to actually leverage massive computing capacity. "The challenge was not just getting it up and running, but really getting customers and users to take advantage of this massive capacity," he explains. "And it came from not having the right infrastructure management tools."

This insight became the foundation for FlexAI, which launched two years ago with a bold vision: create a universal AI platform that works across any GPU architecture: Nvidia, AMD, Amazon's Trainium, or emerging alternatives.

Beyond "GPU as a Service"

Traditional GPU-as-a-service offerings, Tripathi argues, are essentially "bare metal" solutions. You get access to GPUs, but then you need to build an entire software stack on top before you can actually do anything useful. The industry typically struggles for 2-3 months just to get systems operational, even for straightforward use cases.

FlexAI takes a different approach. The platform offers:

Two-click training deployment: Users can start training models immediately, paying only for work done rather than rental costs for weeks or months
Serverless inference: Automatic scaling and architecture recommendations based on whether you prioritize cost or latency
Multi-architecture support: Seamlessly switch between Nvidia, AMD, TenStorent, and other hardware options

Solving the Cold Start Problem

Perhaps FlexAI's most innovative contribution is their solution to the notorious "cold start" problem in AI inference. When demand drops and GPUs spin down to save costs, traditional systems face significant delays when spinning back up to handle new requests.

FlexAI's breakthrough? Fractional GPU allocation. Instead of scaling to zero, the system scales down to a fraction of a GPU, keeping the model loaded in memory but consuming minimal compute resources. When demand spikes, the system instantly scales up to full GPUs without any cold start delay.

"There's no cold start issue," Tripathi emphasizes. "The moment they want higher demand, the data is still there in memory, the model is still loaded, but it's just consuming a very small fraction of compute."

The Cost Crisis in AI

Tripathi is blunt about the economic challenges facing AI companies: "The cost per token is extremely high. You will just get killed within months if you don't take control of that part of your business."

This isn't just about business margins. It's about access. If AI infrastructure remains prohibitively expensive, large portions of society won't have access to these transformative technologies. Tripathi points to examples like Airtel in India offering Perplexity free to subscribers, noting that such initiatives only work if the underlying costs decrease dramatically.

The Future: 90% Inference, 10% Training

Looking ahead, Tripathi predicts a fundamental shift in how compute resources are allocated: "As this industry settles down, I think the ratio is going to be 90 to 10. 90% compute is going to be spent on inference, 10% on training."

A handful of companies will continue training massive foundational models, but most organizations will run optimized, fine-tuned, or smaller versions tailored to specific use cases. This shift makes cost optimization and efficient inference even more critical.

The Death of RAG?

In perhaps his most contrarian prediction, Tripathi sees retrieval-augmented generation (RAG) as a dying technology. "As the model context windows are increasing, the old hacks of using RAG to do your custom search based on your private data and all that are dying," he says. With context windows growing to millions of tokens, the need for RAG-based workarounds diminishes significantly.

A Platform Built by Infrastructure Veterans

FlexAI's approach stems directly from Tripathi's background in computer architecture and GPU design. The team is tackling genuinely hard problems (GPU virtualization, multi-tenancy, efficient workload scheduling) with the goal of compressing what took general-purpose cloud computing 10-15 years to achieve into just two years for AI infrastructure.

The company serves customers across healthcare, legal, crypto, and other sectors, all facing the same fundamental challenge: how to mature AI solutions while controlling costs. By providing an integrated platform that handles model experimentation, quantization, architecture optimization, and seamless deployment, FlexAI aims to make sophisticated AI infrastructure accessible to companies that couldn't otherwise afford to build it themselves.

Getting Started: Companies interested in exploring FlexAI can visit flex.ai to start with free credits, or connect with the team on Discord and LinkedIn.

Listen to the podcast here.

From Supercomputers to Serverless: How FlexAI is Solving the GPU Infrastructure Challenge

The Infrastructure Management Problem#

Beyond "GPU as a Service"#

Solving the Cold Start Problem#

The Cost Crisis in AI#

The Future: 90% Inference, 10% Training#

The Death of RAG?#

A Platform Built by Infrastructure Veterans#

Start building today