Intro to Doubleword Inference
Doubleword provides three styles of inference, each optimized for different workloads. Async and batch inference offer significant cost savings over real-time pricing by deferring processing from synchronous to asynchronous execution.
All three styles use the same OpenAI-compatible API format and share the same model catalog.
| Realtime | Async | Batch | |
|---|---|---|---|
| How it works | Standard request-response | Responses API with service_tier: "flex" or Autobatcher | Upload JSONL file, retrieve results later |
| Latency | Immediate | Minutes (1h SLA) | Hours (24h SLA) |
| Cost | Standard pricing | Reduced pricing | Lowest pricing |
| API change | None — drop-in OpenAI replacement | Set service_tier: "flex" on Responses API, or swap SDK import for Autobatcher | JSONL file preparation |
| Best for | Interactive chat, prototyping, prompt iteration | Agentic workflows, background pipelines, production workloads | Dataset processing, evaluations, bulk generation |
Realtime Inference
Realtime inference works exactly like the standard OpenAI API — send a request, get an immediate response. It's ideal for interactive use cases, development, and prototyping.
Use the Chat Completions API or the Responses API with service_tier: "priority". Supports background: true for fire-and-forget with polling.
No cost savings, but no latency trade-off either.
Get started with Realtime Inference →
Async Inference
Async inference processes your requests within a 1-hour completion window at reduced cost. Two approaches:
- Responses API — Set
service_tier: "flex"on the Responses API for native async support with background polling - Autobatcher — The Autobatcher automatically converts existing Chat Completions code into async batches with a single import change
Best suited for:
- Multi-step agentic workflows where each call doesn't need an instant response
- Background content generation and classification pipelines
- Any application code that can tolerate short async delays
- Teams migrating from OpenAI who want immediate cost savings with zero refactoring
Get started with Async Inference →
Batch Inference
Batch inference is designed for large-scale data processing workloads that run outside of your application code. You upload requests as JSONL files and retrieve results when processing is complete.
With a 24-hour SLA, batch inference offers the deepest cost savings — ideal for workloads where turnaround time is measured in hours, not seconds.
Best suited for:
- Large dataset processing and transformation
- Model evaluations and benchmarking
- Bulk content generation and classification
- Research workflows and data enrichment