OpenAGI emerges from stealth with an AI agent that it claims crushes OpenAI and Anthropic

Contents

Why college researchers constructed a more durable benchmark to check AI brokers—and what they found How OpenAGI skilled its AI to take actions as an alternative of simply producing textual content In contrast to browser-only opponents, Lux can management Slack, Excel, and different desktop functions What occurs once you ask an AI agent to repeat your financial institution particulars The MIT researcher who constructed two of GitHub's most downloaded AI fashions Contained in the billion-dollar race to construct AI that controls your pc

A stealth synthetic intelligence startup based by an MIT researcher emerged this morning with an bold declare: its new AI mannequin can management computer systems higher than programs constructed by OpenAI and Anthropic — at a fraction of the associated fee.

OpenAGI, led by chief govt Zengyi Qin, launched Lux, a basis mannequin designed to function computer systems autonomously by deciphering screenshots and executing actions throughout desktop functions. The San Francisco-based firm says Lux achieves an 83.6 % success price on On-line-Mind2Web, a benchmark that has change into the trade's most rigorous check for evaluating AI brokers that management computer systems.

That rating is a major leap over the main fashions from well-funded opponents. OpenAI's Operator, launched in January, scores 61.3 % on the identical benchmark. Anthropic's Claude Laptop Use achieves 56.3 %.

"Conventional LLM coaching feeds a considerable amount of textual content corpus into the mannequin. The mannequin learns to supply textual content," Qin mentioned in an unique interview with VentureBeat. "Against this, our mannequin learns to supply actions. The mannequin is skilled with a considerable amount of pc screenshots and motion sequences, permitting it to supply actions to manage the pc."

The announcement arrives at a pivotal second for the AI trade. Expertise giants and startups alike have poured billions of {dollars} into creating autonomous brokers able to navigating software program, reserving journey, filling out types, and executing advanced workflows. OpenAI, Anthropic, Google, and Microsoft have all launched or introduced agent merchandise prior to now yr, betting that computer-controlling AI will change into as transformative as chatbots.

But impartial analysis has forged doubt on whether or not present brokers are as succesful as their creators counsel.

Why college researchers constructed a more durable benchmark to check AI brokers—and what they found

The On-line-Mind2Web benchmark, developed by researchers at Ohio State College and the College of California, Berkeley, was designed particularly to show the hole between advertising and marketing claims and precise efficiency.

Revealed in April and accepted to the Convention on Language Modeling 2025, the benchmark contains 300 numerous duties throughout 136 actual web sites — the whole lot from reserving flights to navigating advanced e-commerce checkouts. In contrast to earlier benchmarks that cached components of internet sites, On-line-Mind2Web assessments brokers in reside on-line environments the place pages change dynamically and sudden obstacles seem.

The outcomes, in keeping with the researchers, painted "a really completely different image of the competency of present brokers, suggesting over-optimism in beforehand reported outcomes."

When the Ohio State crew examined 5 main internet brokers with cautious human analysis, they discovered that many current programs — regardless of heavy funding and advertising and marketing fanfare — didn’t outperform SeeAct, a comparatively easy agent launched in January 2024. Even OpenAI's Operator, the most effective performer amongst industrial choices of their research, achieved solely 61 % success.

"It appeared that extremely succesful and sensible brokers have been possibly certainly simply months away," the researchers wrote in a weblog publish accompanying their paper. "Nonetheless, we’re additionally effectively conscious that there are nonetheless many basic gaps in analysis to totally autonomous brokers, and present brokers are most likely not as competent because the reported benchmark numbers might depict."

The benchmark has gained traction as an trade normal, with a public leaderboard hosted on Hugging Face monitoring submissions from analysis teams and corporations.

How OpenAGI skilled its AI to take actions as an alternative of simply producing textual content

OpenAGI's claimed efficiency benefit stems from what the corporate calls "Agentic Lively Pre-training," a coaching methodology that differs basically from how most massive language fashions be taught.

Typical language fashions prepare on huge textual content corpora, studying to foretell the subsequent phrase in a sequence. The ensuing programs excel at producing coherent textual content however weren’t designed to take actions in graphical environments.

Lux, in keeping with Qin, takes a distinct strategy. The mannequin trains on pc screenshots paired with motion sequences, studying to interpret visible interfaces and decide which clicks, keystrokes, and navigation steps will accomplish a given objective.

"The motion permits the mannequin to actively discover the pc surroundings, and such exploration generates new information, which is then fed again to the mannequin for coaching," Qin instructed VentureBeat. "This can be a naturally self-evolving course of, the place a greater mannequin produces higher exploration, higher exploration produces higher information, and higher information results in a greater mannequin."

This self-reinforcing coaching loop, if it capabilities as described, may assist clarify how a smaller crew would possibly obtain outcomes that elude bigger organizations. Fairly than requiring ever-larger static datasets, the strategy would enable the mannequin to constantly enhance by producing its personal coaching knowledge by exploration.

OpenAGI additionally claims vital price benefits. The corporate says Lux operates at roughly one-tenth the price of frontier fashions from OpenAI and Anthropic whereas executing duties sooner.

In contrast to browser-only opponents, Lux can management Slack, Excel, and different desktop functions

A important distinction in OpenAGI's announcement: Lux can management functions throughout a whole desktop working system, not simply internet browsers.

Most commercially out there computer-use brokers, together with early variations of Anthropic's Claude Laptop Use, focus totally on browser-based duties. That limitation excludes huge classes of productiveness work that happen in desktop functions — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe merchandise, code enhancing in improvement environments.

OpenAGI says Lux can navigate these native functions, a functionality that will considerably increase the addressable marketplace for computer-use brokers. The corporate is releasing a developer software program improvement equipment alongside the mannequin, permitting third events to construct functions on high of Lux.

The corporate can also be working with Intel to optimize Lux for edge units, which might enable the mannequin to run regionally on laptops and workstations moderately than requiring cloud infrastructure. That partnership may handle enterprise issues about sending delicate display screen knowledge to exterior servers.

"We’re partnering with Intel to optimize our mannequin on edge units, which can make it the most effective on-device computer-use mannequin," Qin mentioned.

The corporate confirmed it’s in exploratory discussions with AMD and Microsoft about extra partnerships.

What occurs once you ask an AI agent to repeat your financial institution particulars

Laptop-use brokers current novel security challenges that don’t come up with typical chatbots. An AI system able to clicking buttons, coming into textual content, and navigating functions may, if misdirected, trigger vital hurt — transferring cash, deleting information, or exfiltrating delicate info.

OpenAGI says it has constructed security mechanisms straight into Lux. When the mannequin encounters requests that violate its security insurance policies, it refuses to proceed and alerts the person.

In an instance supplied by the corporate, when a person requested the mannequin to "copy my financial institution particulars and paste it into a brand new Google doc," Lux responded with an inner reasoning step: "The person asks me to repeat the financial institution particulars, that are delicate info. Based mostly on the protection coverage, I’m not capable of carry out this motion." The mannequin then issued a warning to the person moderately than executing the possibly harmful request.

Such safeguards will face intense scrutiny as computer-use brokers proliferate. Safety researchers have already demonstrated immediate injection assaults in opposition to early agent programs, the place malicious directions embedded in web sites or paperwork can hijack an agent's conduct. Whether or not Lux's security mechanisms can stand up to adversarial assaults stays to be examined by impartial researchers.

The MIT researcher who constructed two of GitHub's most downloaded AI fashions

Qin brings an uncommon mixture of educational credentials and entrepreneurial expertise to OpenAGI.

He accomplished his doctorate on the Massachusetts Institute of Expertise in 2025, the place his analysis centered on pc imaginative and prescient, robotics, and machine studying. His tutorial work appeared in high venues together with the Convention on Laptop Imaginative and prescient and Sample Recognition, the Worldwide Convention on Studying Representations, and the Worldwide Convention on Machine Studying.

Earlier than founding OpenAGI, Qin constructed a number of broadly adopted AI programs. JetMoE, a big language mannequin he led improvement on, demonstrated {that a} high-performing mannequin could possibly be skilled from scratch for lower than $100,000 — a fraction of the tens of tens of millions sometimes required. The mannequin outperformed Meta's LLaMA2-7B on normal benchmarks, in keeping with a technical report that attracted consideration from MIT's Laptop Science and Synthetic Intelligence Laboratory.

His earlier open-source tasks achieved exceptional adoption. OpenVoice, a voice cloning mannequin, accrued roughly 35,000 stars on GitHub and ranked within the high 0.03 % of open-source tasks by recognition. MeloTTS, a text-to-speech system, has been downloaded greater than 19 million instances, making it one of the crucial broadly used audio AI fashions since its 2024 launch.

Qin additionally co-founded MyShell, an AI agent platform that has attracted six million customers who’ve collectively constructed greater than 200,000 AI brokers. Customers have had multiple billion interactions with brokers on the platform, in keeping with the corporate.

Contained in the billion-dollar race to construct AI that controls your pc

The pc-use agent market has attracted intense curiosity from buyers and know-how giants over the previous yr.

OpenAI launched Operator in January, permitting customers to instruct an AI to finish duties throughout the net. Anthropic has continued creating Claude Laptop Use, positioning it as a core functionality of its Claude mannequin household. Google has integrated agent options into its Gemini merchandise. Microsoft has built-in agent capabilities throughout its Copilot choices and Home windows.

But the market stays nascent. Enterprise adoption has been restricted by issues about reliability, safety, and the power to deal with edge instances that happen regularly in real-world workflows. The efficiency gaps revealed by benchmarks like On-line-Mind2Web counsel that present programs is probably not prepared for mission-critical functions.

OpenAGI enters this aggressive panorama as an impartial various, positioning superior benchmark efficiency and decrease prices in opposition to the huge sources of its well-funded rivals. The corporate's Lux mannequin and developer SDK can be found starting right this moment.

Whether or not OpenAGI can translate benchmark dominance into real-world reliability stays the central query. The AI trade has a protracted historical past of spectacular demos that falter in manufacturing, of laboratory outcomes that crumble in opposition to the chaos of precise use. Benchmarks measure what they measure, and the gap between a managed check and an 8-hour workday stuffed with edge instances, exceptions, and surprises could be huge.

But when Lux performs within the wild the way in which it performs within the lab, the implications lengthen far past one startup's success. It could counsel that the trail to succesful AI brokers runs not by the most important checkbooks however by the cleverest architectures—{that a} small crew with the best concepts can outmaneuver the giants.

The know-how trade has seen that story earlier than. It hardly ever stays true for lengthy.

[/gpt3]

Search

Latest Stories

1/16: The Takeout with Main Garrett

Justin Herbert, Jaxson Dart’s Relationships Will not Make It, Astrologer Claims

Trump does not suppose there’s any purpose ‘proper now’ to make use of Rebellion Act in Minn.

Canada agrees to chop tariff on Chinese language electrical automobiles in break with the U.S.

Column: Coach Mike Tomlin’s stats communicate for themselves. The remainder is simply noise