When OpenAI went down in December, certainly one of TrueFoundry’s prospects confronted a disaster that had nothing to do with chatbots or content material technology. The corporate makes use of giant language fashions to assist refill prescriptions. Each second of downtime meant hundreds of {dollars} in misplaced income — and sufferers who couldn’t entry their medicines on time.
TrueFoundry, an enterprise AI infrastructure firm, introduced Wednesday a brand new product known as TrueFailover designed to forestall precisely that situation. The system routinely detects when AI suppliers expertise outages, slowdowns, or high quality degradation, then seamlessly reroutes site visitors to backup fashions and areas earlier than customers discover something went incorrect.
"The problem is that within the AI world, failover is not that easy," mentioned Nikunj Bajaj, co-founder and chief govt of TrueFoundry, in an unique interview with VentureBeat. "If you transfer from one mannequin to a different, you even have to contemplate issues like output high quality, latency, and whether or not the immediate even works the identical approach. In lots of instances, the immediate must be adjusted in real-time to forestall outcomes from degrading. That’s not one thing most groups are set as much as handle manually."
The announcement arrives at a pivotal second for enterprise AI adoption. Corporations have moved far past experimentation. AI now powers prescription refills at pharmacies, generates gross sales proposals, assists software program builders, and handles buyer help inquiries. When these programs fail, the implications ripple via total organizations.
Why enterprise AI programs stay dangerously depending on single suppliers
Giant language fashions from OpenAI, Anthropic, Google, and different suppliers have change into important infrastructure for hundreds of companies. However not like conventional cloud providers from Amazon Internet Providers or Microsoft Azure — which provide sturdy uptime ensures backed by a long time of operational expertise — AI suppliers function advanced, resource-intensive programs that stay vulnerable to sudden failures.
"Main LLM suppliers expertise outages, slowdowns, or latency spikes each few weeks or months, and we commonly see the downstream influence on companies that depend on a single supplier," Bajaj advised VentureBeat.
The December OpenAI outage that affected TrueFoundry's pharmacy buyer illustrates the stakes. "At their scale, even seconds of downtime can translate into hundreds of {dollars} in misplaced income," Bajaj defined. "Past the financial influence, there may be additionally a human consequence when sufferers can not entry prescriptions on time. As a result of this buyer had our failover answer in place, they had been capable of reroute requests to a different mannequin supplier inside minutes of detecting the outage. With out that setup, restoration would seemingly have taken hours."
The issue extends past full outages. Partial failures — the place a mannequin slows down or produces lower-quality responses with out going absolutely offline — can quietly destroy consumer expertise and violate service-level agreements. These "gradual however technically up" situations typically show extra damaging than dramatic crashes as a result of they evade conventional monitoring programs whereas steadily eroding efficiency.
Contained in the know-how that retains AI purposes on-line when suppliers fail
TrueFailover operates as a resilience layer on high of TrueFoundry's AI Gateway, which already processes greater than 10 billion requests per 30 days for Fortune 1000 corporations. The system weaves collectively a number of interconnected capabilities right into a unified security web for enterprise AI.
At its core, the product permits multi-model failover by permitting enterprises to outline major and backup fashions throughout suppliers. If OpenAI turns into unavailable, site visitors routinely shifts to Anthropic, Google's Gemini, Mistral, or self-hosted options. The routing occurs transparently, with out requiring software groups to rewrite code or manually intervene.
The system extends this safety throughout geographic boundaries via multi-region and multi-cloud resilience. By distributing AI endpoints throughout zones and cloud suppliers, health-based routing can detect issues in particular areas and divert site visitors to wholesome options. What would in any other case change into a world incident transforms into an invisible infrastructure adjustment that customers by no means understand.
Maybe most critically, TrueFailover employs degradation-aware routing that constantly displays latency, error charges, and high quality indicators. "We have a look at a mixture of indicators that collectively point out when a mannequin's efficiency is beginning to degrade," Bajaj defined. "Giant language fashions are shared assets. Suppliers run the identical mannequin occasion throughout many purchasers, so when demand spikes for one consumer or workload, it may possibly have an effect on everybody else utilizing that mannequin."
The system watches for rising response occasions, growing error charges, and patterns suggesting instability. "Individually, none of those indicators inform the complete story," Bajaj mentioned. "However taken collectively, they permit us to detect early indicators {that a} mannequin is slowing down or changing into unreliable. These indicators feed into an AI-driven system that may resolve when and the way to reroute site visitors earlier than customers expertise a noticeable drop in high quality."
Strategic caching rounds out the safety by shielding suppliers from sudden site visitors spikes and stopping rate-limit cascades throughout high-demand intervals. This permits programs to soak up demand surges and supplier limits with out brownouts or throttling surprises.
The method represents a basic shift in how enterprises ought to take into consideration AI reliability. "TrueFailover is designed to deal with that complexity routinely," Bajaj mentioned. "It constantly displays how fashions behave throughout many purchasers and use instances, seems to be for early warning indicators like rising latency, and takes motion earlier than issues break. Most particular person enterprises shouldn’t have that type of visibility as a result of they’re solely capable of see their very own programs."
The engineering problem of switching fashions with out sacrificing output high quality
One of many thorniest challenges in AI failover entails sustaining constant output high quality when switching between fashions. A immediate optimized for GPT-5 might produce totally different outcomes on Claude or Gemini. TrueFoundry addresses this via a number of mechanisms that stability pace towards precision.
"Some groups depend on the truth that giant fashions have change into adequate that small variations in prompts don’t materially have an effect on the output," Bajaj defined. "In these instances, switching from one supplier to a different can occur with some seen influence — that's not perfect, however some groups select to do it."
Extra subtle implementations keep provider-specific prompts for a similar software. "When site visitors shifts from one mannequin to a different, the immediate shifts with it," Bajaj mentioned. "In that case, failover is not only switching fashions. It’s switching to a configuration that has already been examined."
TrueFailover automates this course of. The system dynamically routes requests and adjusts prompts based mostly on which mannequin handles the question, holding high quality inside acceptable ranges with out handbook intervention. The important thing, Bajaj emphasised, is that "failover is deliberate, not reactive. The logic, prompts, and guardrails are outlined forward of time, which is why finish customers sometimes don’t discover when a change occurs."
Importantly, many failover situations don’t require altering suppliers in any respect. "It may be routing site visitors from the identical mannequin in a single area to a different area, equivalent to from the East Coast to the West Coast, the place no immediate adjustments are required," Bajaj famous. This geographic flexibility supplies a primary line of protection earlier than extra advanced cross-provider switches change into needed.
How regulated industries can use AI failover with out compromising compliance
For enterprises in healthcare, monetary providers, and different regulated sectors, the prospect of AI site visitors routinely routing to totally different suppliers raises speedy compliance considerations. Affected person knowledge can not merely circulate to whichever mannequin occurs to be accessible. Monetary data require strict controls over the place they journey. TrueFoundry constructed express guardrails to handle these constraints.
"TrueFailover won’t ever route knowledge to a mannequin or supplier that an enterprise has not explicitly authorized," Bajaj mentioned. "Every little thing is managed via an admin configuration layer the place groups set clear guardrails upfront."
Enterprises outline precisely which fashions qualify for failover, which suppliers can obtain site visitors, and even which areas or mannequin classes — equivalent to closed-source versus open-source — are acceptable. As soon as these guidelines take impact, TrueFailover operates solely inside them.
"If a mannequin isn’t on the authorized listing, it’s merely not an choice for routing," Bajaj emphasised. "There isn’t a situation the place site visitors is routinely despatched someplace sudden. The concept is to provide groups full management over compliance and knowledge boundaries, whereas nonetheless permitting the system to reply rapidly when one thing goes incorrect. That approach, reliability improves with out compromising safety or regulatory necessities."
This design displays classes discovered from TrueFoundry's current enterprise deployments. A Fortune 50 healthcare firm already makes use of the platform to deal with greater than 500 million IVR calls yearly via an agentic AI system. That buyer required the power to run workloads throughout each cloud and on-premise infrastructure whereas sustaining strict knowledge residency controls — precisely the type of hybrid atmosphere the place failover insurance policies should be exactly outlined.
The place automated failover can not assist and what enterprises should plan for
TrueFoundry acknowledges that TrueFailover can not remedy each reliability drawback. The system operates throughout the guardrails enterprises configure, and people configurations decide what safety is feasible.
"If a workforce permits failover from a big, high-capacity mannequin to a a lot smaller mannequin with out adjusting prompts or expectations, TrueFailover can not assure the identical output high quality," Bajaj defined. "The system can route site visitors, nevertheless it can not make a smaller mannequin behave like a bigger one with out applicable configuration."
Infrastructure constraints additionally restrict safety. If an enterprise hosts its personal fashions and all of them run on the identical GPU cluster, TrueFailover can not assist when that infrastructure fails. "When there isn’t any alternate infrastructure accessible, there may be nothing to fail over to," Bajaj mentioned.
The query of simultaneous multi-provider failures often surfaces in enterprise danger discussions. Bajaj argues this situation, whereas theoretically attainable, hardly ever matches actuality. "In follow, 'taking place' often doesn’t imply a complete supplier is offline throughout all fashions and areas," he defined. "What occurs much more typically is a slowdown or disruption in a particular mannequin or area due to site visitors spikes or capability points."
When that happens, failover can occur at a number of ranges — from on-premise to cloud, cloud to on-premise, one area to a different, one mannequin to a different, and even throughout the similar supplier earlier than switching suppliers totally. "That alone makes it impossible that every thing fails directly," Bajaj mentioned. "The important thing level is that reliability is constructed on layers of redundancy. The extra suppliers, areas, and fashions which can be included within the guardrails, the smaller the prospect that customers expertise a whole outage."
A startup that constructed its platform inside Fortune 500 AI deployments
TrueFoundry has established itself as infrastructure for among the world's largest AI deployments, offering essential context for its failover ambitions. The corporate raised $19 million in Collection A funding in February 2025, led by Intel Capital with participation from Eniac Ventures, Peak XV Companions, and Bounce Capital. Angel buyers together with Gokul Rajaram and Mohit Aron additionally joined the spherical, bringing whole funding to $21 million.
The San Francisco-based firm was based in 2021 by Bajaj and co-founders Abhishek Choudhary and Anuraag Gutgutia, all former Meta engineers who met as classmates at IIT Kharagpur. Initially centered on accelerating machine studying deployments, TrueFoundry pivoted to help generative AI capabilities because the know-how went mainstream in 2023.
The corporate's buyer roster demonstrates enterprise-scale adoption that few AI infrastructure startups can match. Nvidia employs TrueFoundry to construct multi-agent programs that optimize GPU cluster utilization throughout knowledge facilities worldwide — a use case the place even small enhancements in utilization translate into substantial enterprise influence given the insatiable demand for GPU capability. Undertake AI routes greater than 15 million requests and 40 billion enter tokens via TrueFoundry's AI Gateway to energy its enterprise agentic workflows.
Gaming firm Video games 24×7 serves machine studying fashions to greater than 100 million customers via the platform at scales exceeding 200 requests per second. Digital adoption platform Whatfix migrated to a microservices structure on TrueFoundry, decreasing its launch cycle sixfold and chopping testing time by 40 %.
TrueFoundry presently experiences greater than 30 paid prospects worldwide and has indicated it exceeded $1.5 million in annual recurring income final yr whereas quadrupling its buyer base. The corporate manages greater than 1,000 clusters for machine studying workloads throughout its consumer base.
TrueFailover shall be supplied as an add-on module on high of the prevailing TrueFoundry AI Gateway and platform, with pricing following a usage-based mannequin tied to site visitors quantity together with the variety of customers, fashions, suppliers, and areas concerned. An early entry program for design companions opens within the coming weeks.
Why conventional cloud uptime ensures might by no means apply to AI suppliers
Enterprise know-how consumers have lengthy demanded uptime commitments from infrastructure suppliers. Amazon Internet Providers, Microsoft Azure, and Google Cloud all supply service-level agreements with monetary penalties for failures. Will AI suppliers ultimately face comparable expectations?
Bajaj sees basic constraints that make conventional SLAs troublesome to attain within the present technology of AI infrastructure. "Most foundational LLMs in the present day function as shared assets, which is what permits the usual pricing you see publicly marketed," he defined. "Suppliers do supply greater uptime commitments, however that often means devoted capability or reserved infrastructure, and the associated fee will increase considerably."
Even with substantial budgets, enterprises face utilization quotas that create sudden publicity. "If site visitors spikes past these limits, requests can nonetheless spill again into shared infrastructure," Bajaj mentioned. "That makes it arduous to attain the type of arduous ensures enterprises are used to with cloud suppliers."
The economics of working giant language fashions create further boundaries which will persist for years. "LLMs are nonetheless extraordinarily advanced and costly to run. They require large infrastructure and vitality, and we don’t count on a near-term future the place most corporations run a number of, absolutely devoted mannequin situations simply to ensure uptime."
This actuality drives demand for options like TrueFailover that present resilience no matter what particular person suppliers can promise. "Enterprises are realizing that reliability can not come from the mannequin supplier alone," Bajaj mentioned. "It requires further layers of safety to deal with the realities of how these programs function in the present day."
The brand new calculus for corporations that constructed AI into crucial enterprise processes
The timing of TrueFoundry's announcement displays a basic shift in how enterprises use AI — and what they stand to lose when it fails. What started as inner experimentation has developed into customer-facing purposes the place disruptions straight have an effect on income and popularity.
"Many enterprises experimented with Gen AI and agentic programs previously, and manufacturing use instances had been largely internal-facing," Bajaj noticed. "There was no speedy influence on their high line or the general public notion of the enterprise."
That period has ended. "Now that these enterprises have launched public-facing purposes, the place each the highest line and public notion might be impacted if an outage happens, the stakes are a lot greater than they had been even six months in the past. That's why we’re seeing increasingly more consideration on this now."
For corporations which have woven AI into crucial enterprise processes — from prescription refills to buyer help to gross sales operations — the calculus has modified totally. The query is not which mannequin performs greatest on benchmarks or which supplier provides essentially the most compelling options. The query that now retains know-how leaders awake is way easier and much more pressing: what occurs when the AI disappears on the worst attainable second?
Someplace, a pharmacist is filling a prescription. A buyer help agent is resolving a criticism. A gross sales workforce is producing a proposal for a deal that closes tomorrow. All of them rely on AI programs that rely on suppliers that, regardless of their scale and class, nonetheless go darkish with out warning.
TrueFoundry is betting that enterprises can pay handsomely to make sure these moments of darkness by no means attain the individuals who matter most — their prospects.
[/gpt3]

