DoublewordDoubleword

Async Inference

Async inference lets you use the familiar OpenAI-compatible API to make LLM requests that are automatically deferred from real-time to high-priority asynchronous processing. The result is significant cost savings with minimal changes to your workflow.

This is powered by the Autobatcher — a Python client that collects your individual API calls and submits them as optimized async requests behind the scenes.

Why Async Inference?

  • OpenAI-compatible — Uses the same openai SDK and API format you already know
  • Drop-in cost savings — Switch your base URL and API key; your existing code works as-is
  • Priority processing — Requests use a 1-hour SLA, balancing cost and speed
  • No JSONL files — Unlike batch inference, you don't need to prepare input files

When to use Async Inference

Async inference is the right choice when your application makes LLM calls that don't need to resolve in real-time. Common use cases include:

  • Agentic workflows — Multi-step agent systems where individual steps can be processed asynchronously
  • Background processing — Content generation, summarization, or classification that runs behind a queue
  • Development and testing — Running evaluations or prompt iterations where you don't need instant feedback
  • Cost optimization — Any existing OpenAI integration where you want to reduce spend without refactoring

Quick Start

1. Install the Autobatcher

pip install autobatcher

2. Create an API Key

Generate a key from the Doubleword Console, or sign in above to auto-populate the code examples.

3. Use it like OpenAI

from autobatcher import AsyncOpenAI

client = AsyncOpenAI(api_key="{{apiKey}}")

response = await client.chat.completions.create(
    model="{{selectedModel.id}}",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
)

print(response.choices[0].message.content)
import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://api.doubleword.ai/v1',
  apiKey: '{{apiKey}}'
});

const response = await client.chat.completions.create({
  model: '{{selectedModel.id}}',
  messages: [
    { role: 'user', content: 'Explain quantum computing' }
  ]
});

console.log(response.choices[0].message.content);
curl https://api.doubleword.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer {{apiKey}}" \
  -d '{
    "model": "{{selectedModel.id}}",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ]
  }'

The Autobatcher automatically collects requests and submits them in optimized batches. Your code receives standard ChatCompletion responses — no changes needed to downstream logic.

How It Works

  1. You make API calls using the familiar OpenAI interface
  2. The Autobatcher collects requests over a short time window (default: 1 second)
  3. Collected requests are submitted as a high-priority async batch
  4. Results are polled and returned to your waiting callers as they complete
  5. Your code receives standard ChatCompletion responses

For full configuration options, see the Autobatcher reference.