Chinese language AI startup Zhipu AI aka Z.ai has launched its GLM-4.6V collection, a brand new technology of open-source vision-language fashions (VLMs) optimized for multimodal reasoning, frontend automation, and high-efficiency deployment.
The discharge consists of two fashions in "massive" and "small" sizes:
-
GLM-4.6V (106B), a bigger 106-billion parameter mannequin aimed toward cloud-scale inference
-
GLM-4.6V-Flash (9B), a smaller mannequin of solely 9 billion parameters designed for low-latency, native purposes
Recall that usually talking, fashions with extra parameters — or inner settings governing their habits, i.e. weights and biases — are extra highly effective, performant, and able to acting at a better normal stage throughout extra diverse duties.
Nonetheless, smaller fashions can supply higher effectivity for edge or real-time purposes the place latency and useful resource constraints are vital.
The defining innovation on this collection is the introduction of native perform calling in a vision-language mannequin—enabling direct use of instruments akin to search, cropping, or chart recognition with visible inputs.
With a 128,000 token context size (equal to a 300-page novel's price of textual content exchanged in a single enter/output interplay with the person) and state-of-the-art (SoTA) outcomes throughout greater than 20 benchmarks, the GLM-4.6V collection positions itself as a extremely aggressive various to each closed and open-source VLMs. It's obtainable within the following codecs:
-
API entry through OpenAI-compatible interface
-
Attempt the demo on Zhipu’s net interface
-
Obtain weights from Hugging Face
-
Desktop assistant app obtainable on Hugging Face Areas
Licensing and Enterprise Use
GLM‑4.6V and GLM‑4.6V‑Flash are distributed beneath the MIT license, a permissive open-source license that permits free industrial and non-commercial use, modification, redistribution, and native deployment with out obligation to open-source spinoff works.
This licensing mannequin makes the collection appropriate for enterprise adoption, together with situations that require full management over infrastructure, compliance with inner governance, or air-gapped environments.
Mannequin weights and documentation are publicly hosted on Hugging Face, with supporting code and tooling obtainable on GitHub.
The MIT license ensures most flexibility for integration into proprietary programs, together with inner instruments, manufacturing pipelines, and edge deployments.
Structure and Technical Capabilities
The GLM-4.6V fashions comply with a standard encoder-decoder structure with important variations for multimodal enter.
Each fashions incorporate a Imaginative and prescient Transformer (ViT) encoder—primarily based on AIMv2-Big—and an MLP projector to align visible options with a big language mannequin (LLM) decoder.
Video inputs profit from 3D convolutions and temporal compression, whereas spatial encoding is dealt with utilizing 2D-RoPE and bicubic interpolation of absolute positional embeddings.
A key technical function is the system’s help for arbitrary picture resolutions and side ratios, together with huge panoramic inputs as much as 200:1.
Along with static picture and doc parsing, GLM-4.6V can ingest temporal sequences of video frames with express timestamp tokens, enabling strong temporal reasoning.
On the decoding aspect, the mannequin helps token technology aligned with function-calling protocols, permitting for structured reasoning throughout textual content, picture, and gear outputs. That is supported by prolonged tokenizer vocabulary and output formatting templates to make sure constant API or agent compatibility.
Native Multimodal Software Use
GLM-4.6V introduces native multimodal perform calling, permitting visible belongings—akin to screenshots, pictures, and paperwork—to be handed straight as parameters to instruments. This eliminates the necessity for intermediate text-only conversions, which have traditionally launched info loss and complexity.
The software invocation mechanism works bi-directionally:
-
Enter instruments may be handed pictures or movies straight (e.g., doc pages to crop or analyze).
-
Output instruments akin to chart renderers or net snapshot utilities return visible information, which GLM-4.6V integrates straight into the reasoning chain.
In follow, this implies GLM-4.6V can full duties akin to:
-
Producing structured experiences from mixed-format paperwork
-
Performing visible audit of candidate pictures
-
Mechanically cropping figures from papers throughout technology
-
Conducting visible net search and answering multimodal queries
Excessive Efficiency Benchmarks In comparison with Different Comparable-Sized Fashions
GLM-4.6V was evaluated throughout greater than 20 public benchmarks masking normal VQA, chart understanding, OCR, STEM reasoning, frontend replication, and multimodal brokers.
In accordance with the benchmark chart launched by Zhipu AI:
-
GLM-4.6V (106B) achieves SoTA or near-SoTA scores amongst open-source fashions of comparable dimension (106B) on MMBench, MathVista, MMLongBench, ChartQAPro, RefCOCO, TreeBench, and extra.
-
GLM-4.6V-Flash (9B) outperforms different light-weight fashions (e.g., Qwen3-VL-8B, GLM-4.1V-9B) throughout nearly all classes examined.
-
The 106B mannequin’s 128K-token window permits it to outperform bigger fashions like Step-3 (321B) and Qwen3-VL-235B on long-context doc duties, video summarization, and structured multimodal reasoning.
Instance scores from the leaderboard embody:
-
MathVista: 88.2 (GLM-4.6V) vs. 84.6 (GLM-4.5V) vs. 81.4 (Qwen3-VL-8B)
-
WebVoyager: 81.0 vs. 68.4 (Qwen3-VL-8B)
-
Ref-L4-test: 88.9 vs. 89.5 (GLM-4.5V), however with higher grounding constancy at 87.7 (Flash) vs. 86.8
Each fashions have been evaluated utilizing the vLLM inference backend and help SGLang for video-based duties.
Frontend Automation and Lengthy-Context Workflows
Zhipu AI emphasised GLM-4.6V’s means to help frontend improvement workflows. The mannequin can:
-
Replicate pixel-accurate HTML/CSS/JS from UI screenshots
-
Settle for pure language enhancing instructions to switch layouts
-
Establish and manipulate particular UI parts visually
This functionality is built-in into an end-to-end visible programming interface, the place the mannequin iterates on format, design intent, and output code utilizing its native understanding of display screen captures.
In long-document situations, GLM-4.6V can course of as much as 128,000 tokens—enabling a single inference move throughout:
-
150 pages of textual content (enter)
-
200 slide decks
-
1-hour movies
Zhipu AI reported profitable use of the mannequin in monetary evaluation throughout multi-document corpora and in summarizing full-length sports activities broadcasts with timestamped occasion detection.
Coaching and Reinforcement Studying
The mannequin was skilled utilizing multi-stage pre-training adopted by supervised fine-tuning (SFT) and reinforcement studying (RL). Key improvements embody:
-
Curriculum Sampling (RLCS): Dynamically adjusts the issue of coaching samples primarily based on mannequin progress
-
Multi-domain reward programs: Activity-specific verifiers for STEM, chart reasoning, GUI brokers, video QA, and spatial grounding
-
Operate-aware coaching: Makes use of structured tags (e.g., <assume>, <reply>, <|begin_of_box|>) to align reasoning and reply formatting
The reinforcement studying pipeline emphasizes verifiable rewards (RLVR) over human suggestions (RLHF) for scalability, and avoids KL/entropy losses to stabilize coaching throughout multimodal domains
Pricing (API)
Zhipu AI presents aggressive pricing for the GLM-4.6V collection, with each the flagship mannequin and its light-weight variant positioned for prime accessibility.
-
GLM-4.6V: $0.30 (enter) / $0.90 (output) per 1M tokens
-
GLM-4.6V-Flash: Free
In comparison with main vision-capable and text-first LLMs, GLM-4.6V is among the many most cost-efficient for multimodal reasoning at scale. Beneath is a comparative snapshot of pricing throughout suppliers:
USD per 1M tokens — sorted lowest → highest complete value
|
Mannequin |
Enter |
Output |
Complete Price |
Supply |
|
Qwen 3 Turbo |
$0.05 |
$0.20 |
$0.25 |
|
|
ERNIE 4.5 Turbo |
$0.11 |
$0.45 |
$0.56 |
|
|
GLM‑4.6V |
$0.30 |
$0.90 |
$1.20 |
|
|
Grok 4.1 Quick (reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
Grok 4.1 Quick (non-reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
deepseek-chat (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
deepseek-reasoner (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
Qwen 3 Plus |
$0.40 |
$1.20 |
$1.60 |
|
|
ERNIE 5.0 |
$0.85 |
$3.40 |
$4.25 |
|
|
Qwen-Max |
$1.60 |
$6.40 |
$8.00 |
|
|
GPT-5.1 |
$1.25 |
$10.00 |
$11.25 |
|
|
Gemini 2.5 Professional (≤200K) |
$1.25 |
$10.00 |
$11.25 |
|
|
Gemini 3 Professional (≤200K) |
$2.00 |
$12.00 |
$14.00 |
|
|
Gemini 2.5 Professional (>200K) |
$2.50 |
$15.00 |
$17.50 |
|
|
Grok 4 (0709) |
$3.00 |
$15.00 |
$18.00 |
|
|
Gemini 3 Professional (>200K) |
$4.00 |
$18.00 |
$22.00 |
|
|
Claude Opus 4.1 |
$15.00 |
$75.00 |
$90.00 |
Earlier Releases: GLM‑4.5 Sequence and Enterprise Functions
Previous to GLM‑4.6V, Z.ai launched the GLM‑4.5 household in mid-2025, establishing the corporate as a severe contender in open-source LLM improvement.
The flagship GLM‑4.5 and its smaller sibling GLM‑4.5‑Air each help reasoning, software use, coding, and agentic behaviors, whereas providing sturdy efficiency throughout commonplace benchmarks.
The fashions launched twin reasoning modes (“pondering” and “non-thinking”) and will robotically generate full PowerPoint shows from a single immediate — a function positioned to be used in enterprise reporting, schooling, and inner comms workflows. Z.ai additionally prolonged the GLM‑4.5 collection with further variants akin to GLM‑4.5‑X, AirX, and Flash, concentrating on ultra-fast inference and low-cost situations.
Collectively, these options place the GLM‑4.5 collection as a cheap, open, and production-ready various for enterprises needing autonomy over mannequin deployment, lifecycle administration, and integration pipel
Ecosystem Implications
The GLM-4.6V launch represents a notable advance in open-source multimodal AI. Whereas massive vision-language fashions have proliferated over the previous 12 months, few supply:
-
Built-in visible software utilization
-
Structured multimodal technology
-
Agent-oriented reminiscence and resolution logic
Zhipu AI’s emphasis on “closing the loop” from notion to motion through native perform calling marks a step towards agentic multimodal programs.
The mannequin’s structure and coaching pipeline present a continued evolution of the GLM household, positioning it competitively alongside choices like OpenAI’s GPT-4V and Google DeepMind’s Gemini-VL.
Takeaway for Enterprise Leaders
With GLM-4.6V, Zhipu AI introduces an open-source VLM able to native visible software use, long-context reasoning, and frontend automation. It units new efficiency marks amongst fashions of comparable dimension and offers a scalable platform for constructing agentic, multimodal AI programs.
[/gpt3]