GLM 5.2

Type: Generation
Capabilities: reasoning
Cache read: $0.19 per 1M input tokens (0.2× standard input price). See prompt caching.

Overview

Meet GLM-5.2-FP8 - Z.ai’s latest flagship open model for long-horizon agentic work, coding, and complex engineering tasks. GLM-5.2 delivers a major step up from GLM- 5.1, pairing stronger real-world coding performance with a solid 1M-token context window for sustained repository-scale workflows. It performs especially well on agentic engineering benchmarks, including SWE-bench Pro, NL2Repo, DeepSWE, Terminal Bench 2.1, FrontierSWE, and SWE-Marathon, making it well suited for extended coding sessions, terminal use, repository generation, debugging, tool orchestration, and ambiguous multi-step projects.

GLM-5.2-FP8 uses Z.ai’s improved GLM MoE architecture with FP8 quantization, IndexShare sparse-attention optimization, and enhanced speculative decoding via improved MTP, reducing long-context compute while improving throughput.

Best for:

Agentic engineering and complex software development Long-horizon coding, debugging, and terminal workflows Large-context repository understanding and generation Tool-use agents that need sustained iteration over long sessions Ambiguous engineering tasks requiring planning, experiments, and judgment Open-weight deployment where FP8 efficiency matters

Max Total Tokens: 1048576

Sampling Parameters:

We have set the default sampling parameters using the recommended values from the GLM-5.2-FP8 generation configuration:

Temperature=1.0 and TopP=0.95.

You can adjust these on a per-request basis by setting the sampling parameters in the request body.

Thinking Mode:

This model is designed for long-horizon reasoning and supports flexible thinking effort levels to balance latency and quality. It is especially useful when the task requires planning, reading code, running commands, interpreting results, identifying blockers, and iterating toward a working solution.

Reasoning efforts

Supported: none, minimal, low, medium, high, xhigh, max

See the reasoning effort guide for request examples.

Pricing

Priority	Input Tokens (per 1M)	Output Tokens (per 1M)
Realtime¹	$0.93	$3.00
Async	$0.70	$2.25
Batch (24h)	$0.47	$1.50

Playground

Open this model in the Playground.

Realtime availability is limited. Doubleword is primarily a batch API. ↩

GLM 5.2

Overview

Reasoning efforts

Pricing

Playground

Footnotes