Intro to Doubleword Inference | Doubleword Inference API

Doubleword provides three styles of inference, each optimized for different workloads. Pricing scales with how fast you need each response, meaning async and batch inference offer significant cost savings over realtime pricing — the more latency flexibility you have, the lower the rate.

All three styles use OpenAI-compatible APIs. Not every model is available in every tier — visit the model catalog to see tier availability and the price breakdown per model.

	Realtime	Async	Batch
How it works	Standard request-response	Open Responses API with `service_tier: "flex"` or Autobatcher¹	Upload JSONL file, or use Autobatcher¹
Latency	Immediate	Minutes	Hours (24h SLA)
Cost	Standard pricing	Reduced pricing	Lowest pricing
API change	None — drop-in OpenAI replacement	Set `service_tier: "flex"` on Responses API, or swap SDK import for Autobatcher	Prepare a JSONL file, or swap SDK import for Autobatcher
Best for	Interactive chat, prototyping, prompt iteration	Agentic workflows, background pipelines, production workloads	Dataset processing, evaluations, bulk generation

1. Autobatcher is a client library that exposes the same interface as the OpenAI SDK — swap the SDK for autobatcher to run your existing code as async or batch. See the async and batch pages for details.

Realtime Inference

Realtime inference works exactly like the standard OpenAI API — send a request, get an immediate response. It's ideal for interactive use cases, development, and prototyping.

Use the Chat Completions API or the Open Responses API with service_tier: "priority". Supports background: true to submit and poll for the result.

No cost savings, but no latency trade-off either.

Get started with Realtime Inference →

Async Inference

Async inference strikes the balance between realtime and batch — faster turnaround than batch, with higher throughput than realtime — all at reduced cost. Two approaches:

Responses API — Set service_tier: "flex" on the Responses API for native async support with background polling
Autobatcher — The Autobatcher's AsyncOpenAI client automatically runs existing Chat Completions code asynchronously with a single import change

Best suited for:

Multi-step agentic workflows where each call doesn't need an instant response
Background content generation and classification pipelines
Any application code that can tolerate short async delays
Teams migrating from OpenAI who want immediate cost savings with zero refactoring

Get started with Async Inference →

Batch Inference

Batch inference is designed for large-scale data processing workloads that run outside of your application code. You upload requests as JSONL files and retrieve results when processing is complete.

With a 24-hour SLA, batch inference offers the deepest cost savings — ideal for workloads where turnaround time is measured in hours, not seconds.

You can prepare requests as JSONL files directly, or use the Autobatcher's BatchOpenAI client to get batch pricing from existing Chat Completions code without writing JSONL files yourself.

Best suited for:

Large dataset processing and transformation
Model evaluations and benchmarking
Bulk content generation and classification
Research workflows and data enrichment

Get started with Batch Inference →