Async Inference
Async inference lets you make LLM requests at reduced cost by deferring processing from real-time. Your requests are queued and processed within a 1-hour completion window, with results available via polling.
Why Async Inference?
- OpenAI-compatible — Uses the standard
openaiSDK with the Responses API - Lower cost — Async requests are priced below realtime, above batch
- No JSONL files — Unlike batch inference, you make standard API calls
- Background or blocking — Return immediately with a response ID, or hold the connection until complete
When to use Async Inference
Async inference is the right choice when your application makes LLM calls that don't need to resolve instantly. Common use cases include:
- Agentic workflows — Multi-step agent systems where individual steps can be processed asynchronously
- Background processing — Content generation, summarization, or classification running behind a queue
- Development and testing — Running evaluations or prompt iterations where you don't need instant feedback
- Cost optimization — Any workload where a 1-hour completion window is acceptable
Quick Start
1. Create an API Key
Generate a key from the Doubleword Console, or sign in above to auto-populate the code examples.
2. Submit a request with service_tier: "flex"
from openai import OpenAI
from time import sleep
client = OpenAI(
base_url="https://api.doubleword.ai/v1",
api_key="{{apiKey}}"
)
# Submit an async request — returns immediately with status "queued"
resp = client.responses.create(
model="{{selectedModel.id}}",
input="Explain the theory of relativity in detail.",
service_tier="flex",
background=True,
)
print(f"Queued: {resp.id} (status: {resp.status})")
# Poll until the daemon completes it
while resp.status in ("queued", "in_progress"):
sleep(2)
resp = client.responses.retrieve(resp.id)
print(f"Status: {resp.status}")
print(f"
Output:
{resp.output_text}")import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.doubleword.ai/v1',
apiKey: '{{apiKey}}'
});
// Submit an async request
const resp = await client.responses.create({
model: '{{selectedModel.id}}',
input: 'Explain the theory of relativity in detail.',
service_tier: 'flex',
background: true,
});
console.log(`Queued: ${resp.id} (status: ${resp.status})`);
// Poll until complete
let result = resp;
while (['queued', 'in_progress'].includes(result.status)) {
await new Promise(r => setTimeout(r, 2000));
result = await client.responses.retrieve(result.id);
console.log(`Status: ${result.status}`);
}
console.log(`
Output:
${result.output_text}`);Blocking mode
If you prefer to hold the connection until the result is ready, omit background:
from openai import OpenAI
client = OpenAI(
base_url="https://api.doubleword.ai/v1",
api_key="{{apiKey}}"
)
# Blocks until the async request completes (up to 1 hour)
resp = client.responses.create(
model="{{selectedModel.id}}",
input="Summarize the history of artificial intelligence.",
service_tier="flex",
)
print(resp.output_text)How It Works
- You submit a request with
service_tier: "flex"via the Responses API - Doubleword creates a batch of 1 with a 1-hour completion window
- The request is queued and processed by the inference daemon
- Results are available via
GET /v1/responses/{id}or by polling - Your code receives a standard Open Responses API response object
Using Autobatcher
For existing Chat Completions code, the Autobatcher can automatically convert your realtime calls into async batches — no code changes required beyond configuration.
from autobatcher import AsyncOpenAI
client = AsyncOpenAI(
api_key="{{apiKey}}",
base_url="https://api.doubleword.ai/v1"
)
# Looks like a normal OpenAI call, but runs asynchronously
response = await client.chat.completions.create(
model="{{selectedModel.id}}",
messages=[{"role": "user", "content": "Explain quantum computing"}],
)
print(response.choices[0].message.content)Next Steps
- Realtime Inference — instant responses with
service_tier: "priority" - Batch Inference — lowest cost for bulk workloads
- Autobatcher reference — drop-in async for existing code