Async Inference | Doubleword Inference API

Async inference lets you make LLM requests at reduced cost by deferring processing from real-time. Your requests are queued and processed within a 1-hour completion window, with results available via polling.

Why Async Inference?

OpenAI-compatible — Uses the standard openai SDK with the Responses API
Lower cost — Async requests are priced below realtime, above batch
No JSONL files — Unlike batch inference, you make standard API calls
Background or blocking — Return immediately with a response ID, or hold the connection until complete

When to use Async Inference

Async inference is the right choice when your application makes LLM calls that don't need to resolve instantly. Common use cases include:

Agentic workflows — Multi-step agent systems where individual steps can be processed asynchronously
Background processing — Content generation, summarization, or classification running behind a queue
Development and testing — Running evaluations or prompt iterations where you don't need instant feedback
Cost optimization — Any workload where a 1-hour completion window is acceptable

Quick Start

1. Create an API Key

Generate a key from the Doubleword Console, or sign in above to auto-populate the code examples.

2. Submit a request with `service_tier: "flex"`

from openai import OpenAI
from time import sleep

client = OpenAI(
    base_url="https://api.doubleword.ai/v1",
    api_key="{{apiKey}}"
)

# Submit an async request — returns immediately with status "queued"
resp = client.responses.create(
    model="{{selectedModel.id}}",
    input="Explain the theory of relativity in detail.",
    service_tier="flex",
    background=True,
)

print(f"Queued: {resp.id} (status: {resp.status})")

# Poll until the daemon completes it
while resp.status in ("queued", "in_progress"):
    sleep(2)
    resp = client.responses.retrieve(resp.id)
    print(f"Status: {resp.status}")

print(f"
Output:
{resp.output_text}")

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://api.doubleword.ai/v1',
  apiKey: '{{apiKey}}'
});

// Submit an async request
const resp = await client.responses.create({
  model: '{{selectedModel.id}}',
  input: 'Explain the theory of relativity in detail.',
  service_tier: 'flex',
  background: true,
});

console.log(`Queued: ${resp.id} (status: ${resp.status})`);

// Poll until complete
let result = resp;
while (['queued', 'in_progress'].includes(result.status)) {
  await new Promise(r => setTimeout(r, 2000));
  result = await client.responses.retrieve(result.id);
  console.log(`Status: ${result.status}`);
}

console.log(`
Output:
${result.output_text}`);

Blocking mode

If you prefer to hold the connection until the result is ready, omit background:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.doubleword.ai/v1",
    api_key="{{apiKey}}"
)

# Blocks until the async request completes (up to 1 hour)
resp = client.responses.create(
    model="{{selectedModel.id}}",
    input="Summarize the history of artificial intelligence.",
    service_tier="flex",
)

print(resp.output_text)

How It Works

You submit a request with service_tier: "flex" via the Responses API
Doubleword creates a batch of 1 with a 1-hour completion window
The request is queued and processed by the inference daemon
Results are available via GET /v1/responses/{id} or by polling
Your code receives a standard Open Responses API response object

Using Autobatcher

For existing Chat Completions code, the Autobatcher can automatically convert your realtime calls into async batches — no code changes required beyond configuration.

from autobatcher import AsyncOpenAI

client = AsyncOpenAI(
    api_key="{{apiKey}}",
    base_url="https://api.doubleword.ai/v1"
)

# Looks like a normal OpenAI call, but runs asynchronously
response = await client.chat.completions.create(
    model="{{selectedModel.id}}",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
)

print(response.choices[0].message.content)

Next Steps

Realtime Inference — instant responses with service_tier: "priority"
Batch Inference — lowest cost for bulk workloads
Autobatcher reference — drop-in async for existing code