Intro to Doubleword Inference
Doubleword provides three styles of inference, each optimized for different workloads. Async and batch inference offer significant cost savings over real-time pricing by deferring processing from synchronous to asynchronous execution.
All three styles use the same OpenAI-compatible API format and share the same model catalog.
| Realtime | Async | Batch | |
|---|---|---|---|
| How it works | Standard request-response | Responses API with service_tier: "flex" or Autobatcher | Upload JSONL file, or use Autobatcher |
| Latency | Immediate | Minutes | Hours (24h SLA) |
| Cost | Standard pricing | Reduced pricing | Lowest pricing |
| API change | None — drop-in OpenAI replacement | Set service_tier: "flex" on Responses API, or swap SDK import for Autobatcher | JSONL file preparation |
| Best for | Interactive chat, prototyping, prompt iteration | Agentic workflows, background pipelines, production workloads | Dataset processing, evaluations, bulk generation |
Realtime Inference
Realtime inference works exactly like the standard OpenAI API — send a request, get an immediate response. It's ideal for interactive use cases, development, and prototyping.
Use the Chat Completions API or the Responses API with service_tier: "priority". Supports background: true for fire-and-forget with polling.
No cost savings, but no latency trade-off either.
Get started with Realtime Inference →
Async Inference
Async inference strikes the balance between realtime and batch — a faster time-to-first-token than batch, with higher throughput than realtime — all at reduced cost. Two approaches:
- Responses API — Set
service_tier: "flex"on the Responses API for native async support with background polling - Autobatcher — The Autobatcher's
AsyncOpenAIclient automatically converts existing Chat Completions code into async batches with a single import change
Best suited for:
- Multi-step agentic workflows where each call doesn't need an instant response
- Background content generation and classification pipelines
- Any application code that can tolerate short async delays
- Teams migrating from OpenAI who want immediate cost savings with zero refactoring
Get started with Async Inference →
Batch Inference
Batch inference is designed for large-scale data processing workloads that run outside of your application code. You upload requests as JSONL files and retrieve results when processing is complete.
With a 24-hour SLA, batch inference offers the deepest cost savings — ideal for workloads where turnaround time is measured in hours, not seconds.
You can prepare requests as JSONL files directly, or use the Autobatcher's BatchOpenAI client to get batch pricing from existing Chat Completions code without writing JSONL files yourself.
Best suited for:
- Large dataset processing and transformation
- Model evaluations and benchmarking
- Bulk content generation and classification
- Research workflows and data enrichment