Async Inference
Async inference lets you make LLM requests at reduced cost by relaxing latency requirements. It balances latency and throughput — faster turnaround than batch, higher throughput than realtime — and results are available via polling.
The request flow mirrors realtime background mode; the one difference is service_tier: "flex", which trades minutes-scale latency for a lower rate.
Why Async Inference?
- OpenAI-compatible — Uses the standard
openaiSDK with the Open Responses API - Lower cost — Async requests are priced below realtime, above batch
- No JSONL files — Unlike batch inference, you make standard API calls
- Background or blocking — Return immediately with a response ID, or hold the connection until complete
When to use Async Inference
Async inference is the right choice when your application makes LLM calls that don't need to resolve instantly. Common use cases include:
- Agentic workflows — Multi-step agent systems where individual steps can be processed asynchronously
- Background processing — Content generation, summarization, or classification running behind a queue
- Development and testing — Running evaluations or prompt iterations where you don't need instant feedback
- Cost optimization — Any workload that can tolerate a short asynchronous delay in exchange for lower cost
Quick Start
1. Create an API Key
Generate a key from the Doubleword Console, or sign in above to auto-populate the code examples.
2. Submit a request with service_tier: "flex"
from openai import OpenAI
from time import sleep
client = OpenAI(
base_url="https://api.doubleword.ai/v1",
api_key="{{apiKey}}"
)
# Submit an async request — returns immediately with status "queued"
resp = client.responses.create(
model="{{selectedModel.id}}",
input="Explain the theory of relativity in detail.",
service_tier="flex",
background=True,
)
print(f"Queued: {resp.id} (status: {resp.status})")
# Poll until the inference service completes it
while resp.status in ("queued", "in_progress"):
sleep(2)
resp = client.responses.retrieve(resp.id)
print(f"Status: {resp.status}")
print(f"
Output:
{resp.output_text}")import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.doubleword.ai/v1',
apiKey: '{{apiKey}}'
});
// Submit an async request
const resp = await client.responses.create({
model: '{{selectedModel.id}}',
input: 'Explain the theory of relativity in detail.',
service_tier: 'flex',
background: true,
});
console.log(`Queued: ${resp.id} (status: ${resp.status})`);
// Poll until complete
let result = resp;
while (['queued', 'in_progress'].includes(result.status)) {
await new Promise(r => setTimeout(r, 2000));
result = await client.responses.retrieve(result.id);
console.log(`Status: ${result.status}`);
}
console.log(`
Output:
${result.output_text}`);Blocking mode
If you prefer to hold the connection until the result is ready, omit background. This is best for short waits — if a request may take minutes, prefer background mode to avoid connection timeouts.
from openai import OpenAI
client = OpenAI(
base_url="https://api.doubleword.ai/v1",
api_key="{{apiKey}}"
)
# Blocks until the async request completes
resp = client.responses.create(
model="{{selectedModel.id}}",
input="Summarize the history of artificial intelligence.",
service_tier="flex",
)
print(resp.output_text)How It Works
- You submit a request with
service_tier: "flex"via the Responses API - Doubleword queues it for asynchronous processing
- The request is queued and processed by the inference service
- Results are available via
GET /v1/responses/{id}or by polling - Your code receives a standard Open Responses API response object
Using Autobatcher
For existing Chat Completions code, the Autobatcher can automatically run your realtime calls asynchronously — no code changes required beyond configuration.
from autobatcher import AsyncOpenAI
client = AsyncOpenAI(
api_key="{{apiKey}}",
base_url="https://api.doubleword.ai/v1"
)
# Looks like a normal OpenAI call, but runs asynchronously
response = await client.chat.completions.create(
model="{{selectedModel.id}}",
messages=[{"role": "user", "content": "Explain quantum computing"}],
)
print(response.choices[0].message.content)Next Steps
- Realtime Inference — instant responses with
service_tier: "priority" - Batch Inference — lowest cost for bulk workloads
- Autobatcher reference — drop-in async for existing code