Intro to Doubleword Inference | Doubleword Inference API

Doubleword provides three styles of inference, each optimized for different workloads. Async and batch inference offer significant cost savings over real-time pricing by deferring processing from synchronous to asynchronous execution.

All three styles use the same OpenAI-compatible API format and share the same model catalog.

	Realtime	Async	Batch
How it works	Standard request-response	Responses API with `service_tier: "flex"` or Autobatcher	Upload JSONL file, retrieve results later
Latency	Immediate	Minutes (1h SLA)	Hours (24h SLA)
Cost	Standard pricing	Reduced pricing	Lowest pricing
API change	None — drop-in OpenAI replacement	Set `service_tier: "flex"` on Responses API, or swap SDK import for Autobatcher	JSONL file preparation
Best for	Interactive chat, prototyping, prompt iteration	Agentic workflows, background pipelines, production workloads	Dataset processing, evaluations, bulk generation

Realtime Inference

Realtime inference works exactly like the standard OpenAI API — send a request, get an immediate response. It's ideal for interactive use cases, development, and prototyping.

Use the Chat Completions API or the Responses API with service_tier: "priority". Supports background: true for fire-and-forget with polling.

No cost savings, but no latency trade-off either.

Get started with Realtime Inference →

Async Inference

Async inference processes your requests within a 1-hour completion window at reduced cost. Two approaches:

Responses API — Set service_tier: "flex" on the Responses API for native async support with background polling
Autobatcher — The Autobatcher automatically converts existing Chat Completions code into async batches with a single import change

Best suited for:

Multi-step agentic workflows where each call doesn't need an instant response
Background content generation and classification pipelines
Any application code that can tolerate short async delays
Teams migrating from OpenAI who want immediate cost savings with zero refactoring

Get started with Async Inference →

Batch Inference

Batch inference is designed for large-scale data processing workloads that run outside of your application code. You upload requests as JSONL files and retrieve results when processing is complete.

With a 24-hour SLA, batch inference offers the deepest cost savings — ideal for workloads where turnaround time is measured in hours, not seconds.

Best suited for:

Large dataset processing and transformation
Model evaluations and benchmarking
Bulk content generation and classification
Research workflows and data enrichment

Get started with Batch Inference →