DoublewordDoubleword

Intro to Doubleword Inference

Doubleword provides three styles of inference, each optimized for different workloads. Async and batch inference offer significant cost savings over real-time pricing by deferring processing from synchronous to asynchronous execution.

All three styles use the same OpenAI-compatible API format and share the same model catalog.

RealtimeAsyncBatch
How it worksStandard request-responseResponses API with service_tier: "flex" or AutobatcherUpload JSONL file, retrieve results later
LatencyImmediateMinutes (1h SLA)Hours (24h SLA)
CostStandard pricingReduced pricingLowest pricing
API changeNone — drop-in OpenAI replacementSet service_tier: "flex" on Responses API, or swap SDK import for AutobatcherJSONL file preparation
Best forInteractive chat, prototyping, prompt iterationAgentic workflows, background pipelines, production workloadsDataset processing, evaluations, bulk generation

Realtime Inference

Realtime inference works exactly like the standard OpenAI API — send a request, get an immediate response. It's ideal for interactive use cases, development, and prototyping.

Use the Chat Completions API or the Responses API with service_tier: "priority". Supports background: true for fire-and-forget with polling.

No cost savings, but no latency trade-off either.

Get started with Realtime Inference →


Async Inference

Async inference processes your requests within a 1-hour completion window at reduced cost. Two approaches:

  • Responses API — Set service_tier: "flex" on the Responses API for native async support with background polling
  • Autobatcher — The Autobatcher automatically converts existing Chat Completions code into async batches with a single import change

Best suited for:

  • Multi-step agentic workflows where each call doesn't need an instant response
  • Background content generation and classification pipelines
  • Any application code that can tolerate short async delays
  • Teams migrating from OpenAI who want immediate cost savings with zero refactoring

Get started with Async Inference →


Batch Inference

Batch inference is designed for large-scale data processing workloads that run outside of your application code. You upload requests as JSONL files and retrieve results when processing is complete.

With a 24-hour SLA, batch inference offers the deepest cost savings — ideal for workloads where turnaround time is measured in hours, not seconds.

Best suited for:

  • Large dataset processing and transformation
  • Model evaluations and benchmarking
  • Bulk content generation and classification
  • Research workflows and data enrichment

Get started with Batch Inference →