By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
Scoopico
  • Home
  • U.S.
  • Politics
  • Sports
  • True Crime
  • Entertainment
  • Life
  • Money
  • Tech
  • Travel
Reading: Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
Share
Font ResizerAa
ScoopicoScoopico
Search

Search

  • Home
  • U.S.
  • Politics
  • Sports
  • True Crime
  • Entertainment
  • Life
  • Money
  • Tech
  • Travel

Latest Stories

Loons out to beat hurdle, high Sounders in decisive Sport 3
Loons out to beat hurdle, high Sounders in decisive Sport 3
NYU’s new AI structure makes high-quality picture technology quicker and cheaper
NYU’s new AI structure makes high-quality picture technology quicker and cheaper
Choose completely bars Trump from deploying Nationwide Guard troops to Portland in response to immigration protests
Choose completely bars Trump from deploying Nationwide Guard troops to Portland in response to immigration protests
Republicans in Congress Are Ceding Tariff Accountability and Conflict Powers to Trump
Republicans in Congress Are Ceding Tariff Accountability and Conflict Powers to Trump
Younger and Stressed Subsequent Week: Audra’s Fiery Meltdown & Noah’s Surprising Bust Uncovered!
Younger and Stressed Subsequent Week: Audra’s Fiery Meltdown & Noah’s Surprising Bust Uncovered!
Have an existing account? Sign In
Follow US
  • Contact Us
  • Privacy Policy
  • Terms of Service
2025 Copyright © Scoopico. All rights reserved
Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
Tech

Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers

Scoopico
Last updated: November 8, 2025 12:26 am
Scoopico
Published: November 8, 2025
Share
SHARE



Contents
Greater Bar, Cleaner KnowledgeHarbor: Unified Rollouts at ScaleEarly Outcomes: GPT-5 Leads in Activity SuccessSubmission and UseAiming for Standardization

The builders of Terminal-Bench, a benchmark suite for evaluating the efficiency of autonomous AI brokers on real-world terminal-based duties, have launched model 2.0 alongside Harbor, a brand new framework for testing, enhancing and optimizing AI brokers in containerized environments.

The twin launch goals to deal with long-standing ache factors in testing and optimizing AI brokers, notably these constructed to function autonomously in lifelike developer environments.

With a harder and rigorously verified process set, Terminal-Bench 2.0 replaces model 1.0 as the usual for assessing frontier mannequin capabilities.

Harbor, the accompanying runtime framework, permits builders and researchers to scale evaluations throughout 1000’s of cloud containers and integrates with each open-source and proprietary brokers and coaching pipelines.

“Harbor is the bundle we want we had had whereas making Terminal-Bench," wrote co-creator Alex Shaw on X. "It’s for agent, mannequin, and benchmark builders and researchers who need to consider and enhance brokers and fashions."

Greater Bar, Cleaner Knowledge

Terminal-Bench 1.0 noticed speedy adoption after its launch in Could 2025, changing into a default benchmark for evaluating agent efficiency throughout the sphere of AI-powered brokers working in developer-style terminal environments. These brokers work together with methods by the command line, mimicking how builders work behind the scenes of the graphical person interface.

Nonetheless, its broad scope got here with inconsistencies. A number of duties have been recognized by the neighborhood as poorly specified or unstable attributable to exterior service modifications.

Model 2.0 addresses these points instantly. The up to date suite contains 89 duties, every subjected to a number of hours of guide and LLM-assisted validation. The emphasis is on making duties solvable, lifelike, and clearly specified, elevating the problem ceiling whereas enhancing reliability and reproducibility.

A notable instance is the download-youtube process, which was eliminated or refactored in 2.0 attributable to its dependence on unstable third-party APIs.

“Astute Terminal-Bench followers might discover that SOTA efficiency is similar to TB1.0 regardless of our declare that TB2.0 is more durable,” Shaw famous on X. “We imagine it’s because process high quality is considerably greater within the new benchmark.”

Harbor: Unified Rollouts at Scale

Alongside the benchmark replace, the crew launched Harbor, a brand new framework for operating and evaluating brokers in cloud-deployed containers.

Harbor helps large-scale rollout infrastructure, with compatibility for main suppliers like Daytona and Modal.

Designed to generalize throughout agent architectures, Harbor helps:

  • Analysis of any container-installable agent

  • Scalable supervised fine-tuning (SFT) and reinforcement studying (RL) pipelines

  • Customized benchmark creation and deployment

  • Full integration with Terminal-Bench 2.

Harbor was used internally to run tens of 1000’s of rollouts throughout the creation of the brand new benchmark. It’s now publicly accessible by way of harborframework.com, with documentation for testing and submitting brokers to the general public leaderboard.

Early Outcomes: GPT-5 Leads in Activity Success

Preliminary outcomes from the Terminal-Bench 2.0 leaderboard present OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, within the lead, with a 49.6% success charge — the very best amongst all brokers examined thus far.

Shut behind are different GPT-5 variants and Claude Sonnet 4.5-based brokers.

High 5 Agent Outcomes (Terminal-Bench 2.0):

  1. Codex CLI (GPT-5) — 49.6%

  2. Codex CLI (GPT-5-Codex) — 44.3%

  3. OpenHands (GPT-5) — 43.8%

  4. Terminus 2 (GPT-5-Codex) — 43.4%

  5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

The shut clustering amongst prime fashions signifies energetic competitors throughout platforms, with no single agent fixing greater than half the duties.

Submission and Use

To check or submit an agent, customers set up Harbor and run the benchmark utilizing easy CLI instructions. Submissions to the leaderboard require 5 benchmark runs, and outcomes may be emailed to the builders together with job directories for validation.

harbor run -d terminal-bench@2.0 -m "<mannequin>" -a "<agent>" –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is already being built-in into analysis workflows centered on agentic reasoning, code era, and gear use. Based on co-creator Mike Merrill, a postdoctoral researcher at Stanford, an in depth preprint is in progress overlaying the verification course of and design methodology behind the benchmark.

Aiming for Standardization

The mixed launch of Terminal-Bench 2.0 and Harbor marks a step towards extra constant and scalable agent analysis infrastructure. As LLM brokers proliferate in developer and operational environments, the necessity for managed, reproducible testing has grown.

These instruments supply a possible basis for a unified analysis stack — supporting mannequin enchancment, setting simulation, and benchmark standardization throughout the AI ecosystem.

[/gpt3]

Save $140 on This 65” QLED 4K Hisense Good TV with Dolby Imaginative and prescient, AI Gentle Sensor, and Hearth Television
This Turbo Escooter Desires to Set a Guinness World Document
Finest monitor deal: $799.99 Acer 49-inch EI1 curved monitor
Finest gaming SSD in 2025 (UK)
Finest October Prime Day streaming offers: Starz, Criterion Channel, and extra
Share This Article
Facebook Email Print

POPULAR

Loons out to beat hurdle, high Sounders in decisive Sport 3
Sports

Loons out to beat hurdle, high Sounders in decisive Sport 3

NYU’s new AI structure makes high-quality picture technology quicker and cheaper
Tech

NYU’s new AI structure makes high-quality picture technology quicker and cheaper

Choose completely bars Trump from deploying Nationwide Guard troops to Portland in response to immigration protests
U.S.

Choose completely bars Trump from deploying Nationwide Guard troops to Portland in response to immigration protests

Republicans in Congress Are Ceding Tariff Accountability and Conflict Powers to Trump
Politics

Republicans in Congress Are Ceding Tariff Accountability and Conflict Powers to Trump

Younger and Stressed Subsequent Week: Audra’s Fiery Meltdown & Noah’s Surprising Bust Uncovered!
Entertainment

Younger and Stressed Subsequent Week: Audra’s Fiery Meltdown & Noah’s Surprising Bust Uncovered!

You’ve simply been laid off due to AI — right here’s what to do subsequent
News

You’ve simply been laid off due to AI — right here’s what to do subsequent

Scoopico

Stay ahead with Scoopico — your source for breaking news, bold opinions, trending culture, and sharp reporting across politics, tech, entertainment, and more. No fluff. Just the scoop.

  • Home
  • U.S.
  • Politics
  • Sports
  • True Crime
  • Entertainment
  • Life
  • Money
  • Tech
  • Travel
  • Contact Us
  • Privacy Policy
  • Terms of Service

2025 Copyright © Scoopico. All rights reserved

Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?