Recent off releasing the most recent model of its Olmo basis mannequin, the Allen Institute for AI (Ai2) launched its open-source video mannequin, Molmo 2, on Tuesday, aiming to indicate that smaller, open fashions will be viable choices for enterprises targeted on video understanding and evaluation.
In a press launch, the corporate stated Molmo 2 “takes Molmo’s strengths in grounded imaginative and prescient and expands them to video and multi-image understanding,” a functionality that has largely been dominated by bigger proprietary fashions.
Ai2 launched three variants of Molmo 2:
-
Molmo 2 8B, a Qwen-3–primarily based mannequin that Ai2 describes as its “finest general mannequin for video grounding and QA”
-
Molmo 2 4B, designed for extra environment friendly deployments
-
Molmo 2-O 7B, constructed on the Olmo mannequin
Molmo 2 helps single-image and multi-image inputs, in addition to video clips of various lengths, enabling duties similar to video grounding, monitoring, and query answering.
“Certainly one of our core design objectives was to shut a significant hole in open fashions: grounding,” Ai2 stated in its press launch.
The corporate first launched the Molmo household of open multimodal fashions final 12 months, starting with photographs. Ai2 stated Molmo 2 surpasses earlier variations in accuracy, temporal understanding, and pixel-level grounding, and in some circumstances performs competitively with bigger fashions similar to Google’s Gemini 3.
How Molmo 2 compares
Regardless of their smaller measurement, the Molmo 2 fashions outperformed Gemini 3 Professional and different open-weight opponents on video monitoring benchmarks.
For picture and multi-image reasoning, Ai2 stated Molmo 2 8B “leads all open-weight fashions, with the 4B variant shut behind.” The 8B and 4B fashions additionally confirmed robust efficiency within the open-weight Elo human desire analysis, although Ai2 famous that bigger proprietary fashions proceed to steer that benchmark general.
However Molmo 2’s largest positive aspects are in video grounding and video counting, the place it outscores related open-weight fashions.
“These outcomes spotlight each progress and remaining headroom — video grounding continues to be laborious, and no mannequin but reaches 40% accuracy," Ai2 stated, referring to present benchmarks.
Many video fashions, similar to Google's Veo 3.1 and OpenAI's Sora, are sometimes very massive. Molmo 2 targets a distinct tradeoff: smaller, open fashions optimized for grounding and evaluation slightly than video technology.
[/gpt3]