Supermemory
An AI agent with no memory meets every user as a stranger, but fixing that is incredibly compute-intensive. Every time a user chats, the memory layer must fire off heavy LLM calls to extract facts, update profiles, and consolidate history.
This extraction is purely background work. Nobody is waiting on it. Paying premium, real-time API rates for offline data extraction is money spent on latency you don't need.
By routing Supermemory's extraction engine through Doubleword's asynchronous infrastructure, you can run this heavy background workload for ~99% less. Real-time retrieval stays fast, but background extraction and bulk historical ingestion drop to pennies.
Note: This guide uses Supermemory's open-source, self-hosted server, running as a single local binary on localhost
In a hurry? Jump to the configuration code.
Architecture: What Stays Local vs. What Goes External
Supermemory's open-source offering splits the workload efficiently:
- Storage and Search: Runs against an embedded vector database locally on disk.
- Embeddings: Computed locally by a quantized bge-base-en-v1.5, a lightweight CPU-only model. While you are bound to this specific embedder, it is a strong model with low overhead.
- Extraction: This is the one step that reaches out to an external API. While you could run an extraction model locally, capable reasoning models are slow on CPUs and expensive to host on dedicated GPUs.
Because extraction is an external background step, it slots perfectly onto Doubleword's less expensive async tier for immediate, massive cost savings.
The Cost of Scaling Supermemory: Real-Time vs. Async Tier
Storing a single conversation isn't a single LLM call; the engine makes several calls to extract facts and reconcile them against the user's profile. Below we logged a real run requiring roughly 7,000 to 13,000 input tokens and 1,000 to 1,500 output tokens per conversation. If you are launching an agent and need to ingest 10,000 historical conversations, the cost gap between closed-source real-time models and Doubleword's async infrastructure is stark:
| Extraction LLM | Execution Tier | Cost per 10k Conversations |
|---|---|---|
| Anthropic Claude 3.5 Sonnet | Real-time | ~$370.00 |
| Doubleword (gpt-oss-20b) | Async (Flex) | ~$4.00 |
| Doubleword (gpt-oss-20b) | Batch (24h) | ~$3.00 |
Memory extraction is a mechanical job resulting in short summary sentences. It doesn't require a frontier reasoning model. Capable open-source models on Doubleword are a drop-in replacement that executes the exact same workload for a 99% discount.
Prerequisites
- Python 3.10 or newer.
- Install dependencies:
pip install supermemory python-dotenv openai - Node (for npx), to run the Supermemory server.
- A Doubleword API key. Create one at app.doubleword.ai/api-keys.
Copy .env.example to .env and fill in DOUBLEWORD_API_KEY. (The SUPERMEMORY_API_KEY is printed when the server boots).
Pointing Supermemory at Doubleword
Supermemory configures its LLM with three standard environment variables: OPENAI_BASE_URL, OPENAI_API_KEY, OPENAI_MODEL. Pre-setting them also skips the first-boot setup wizard. Because Supermemory's extraction calls /chat/completions with function-calling, you should use a tools-capable Doubleword model like openai/gpt-oss-20b.
Step 1: The API Gateway Pattern (Async Flex Tier)
Flex is Doubleword's less expensive background tier. Supermemory talks plain /chat/completions, so we will deploy a lightweight local service in front of Doubleword, namely an API Gateway that stamps service_tier="flex" onto each request before forwarding it. (Note: This gateway is also the natural architectural place to add routing, retries, or logging as your app scales).
Save this as doubleword_gateway.py:
# doubleword_gateway.py
import json
import os
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
import httpx
from dotenv import load_dotenv
load_dotenv()
DW = os.environ.get("DOUBLEWORD_BASE_URL", "https://api.doubleword.ai/v1").rstrip("/")
KEY = os.environ["DOUBLEWORD_API_KEY"]
PORT = int(os.environ.get("GATEWAY_PORT", "8088"))
class Handler(BaseHTTPRequestHandler):
def _proxy(self, method):
n = int(self.headers.get("content-length", 0))
body = self.rfile.read(n) if n else b""
# Add the flex tier to chat/completions requests; pass everything else through.
if method == "POST" and self.path.endswith("/chat/completions") and body:
try:
payload = json.loads(body)
payload.setdefault("service_tier", "flex")
body = json.dumps(payload).encode()
except ValueError:
pass
url = DW + self.path[len("/v1"):] if self.path.startswith("/v1") else DW + self.path
r = httpx.request(method, url, content=body, timeout=900,
headers={"Authorization": f"Bearer {KEY}",
"Content-Type": "application/json"})
self.send_response(r.status_code)
self.send_header("Content-Type", r.headers.get("content-type", "application/json"))
self.end_headers()
self.wfile.write(r.content)
def do_POST(self):
self._proxy("POST")
def do_GET(self):
self._proxy("GET")
def log_message(self, *args):
pass
if __name__ == "__main__":
print(f"Doubleword Gateway on http://localhost:{PORT} -> {DW} (service_tier=flex)")
ThreadingHTTPServer(("127.0.0.1", PORT), Handler).serve_forever()Step 2: Executing the Memory Pipeline
Run the gateway in the background, then start the Supermemory server, pointing it at the new local gateway instead of directly at Doubleword:
# Start the gateway
python doubleword_gateway.py & # Runs on http://localhost:8088
# Start Supermemory routed through the gateway
export OPENAI_BASE_URL="http://localhost:8088/v1"
export OPENAI_API_KEY="$DOUBLEWORD_API_KEY"
export OPENAI_MODEL="openai/gpt-oss-20b"
npx supermemory localThe server will print its URL (http://localhost:6767) and an API key starting with sm_. Put that key in your .env. Now, add and search a memory using a standard client script:
# demo.py
import os
import time
from dotenv import load_dotenv
from supermemory import Supermemory
load_dotenv()
client = Supermemory(
api_key=os.environ["SUPERMEMORY_API_KEY"],
base_url=os.environ.get("SUPERMEMORY_BASE_URL", "http://localhost:6767"),
)
USER = "user_123"
# Add a memory. Extraction runs via the gateway on Doubleword's Flex tier.
client.add(
content="Hi, I'm Alex. I love basketball and live in Tokyo.",
container_tags=[USER],
dreaming="instant",
)
print("Memory added; waiting for background extraction...")
# Extraction is asynchronous, so poll search until facts appear.
for _ in range(20):
time.sleep(3)
results = client.search.memories(q="what does he like and where does he live?",
container_tag=USER).results or []
if results:
break
print("\nExtracted memories:")
for m in results:
print(f" - {m.memory}")Run python demo.py. Supermemory extracts the facts via Doubleword's Flex tier, dropping your inference costs by ~99%. Session-based chat contexts persist instantly on the hot path, while the heavy extraction logic happens cheaply in the background.
Solving the Cold-Start Problem: Batch Backfilling
If you are launching a memory-enabled agent today, you likely already have thousands of historical Zendesk tickets, Slack logs, or CRM notes. Processing that massive archive into a long-term user profile is the perfect use case for a 24-hour batch job.
Because Supermemory expects to extract facts synchronously, it cannot drive a 24h batch natively. To bypass this, run the bulk extraction directly against Doubleword's Batch API (our lowest cost tier), then load the clean JSON output directly into Supermemory's database.
Use this script to submit the batch job:
# batch.py
import json
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(api_key=os.environ["DOUBLEWORD_API_KEY"],
base_url=os.environ.get("DOUBLEWORD_BASE_URL", "https://api.doubleword.ai/v1"))
MODEL = os.environ.get("DW_MODEL", "openai/gpt-oss-20b")
INPUT_FILE = "batch_input.jsonl"
SYSTEM = "Extract durable facts about the user as a JSON array of short strings."
documents = [
"I'm migrating our backend from Node to Go. We use AWS.",
"I hate dark mode, please keep my dashboard light.",
"My Q3 campaign budget is capped at $50k.",
]
# Package for Batch API
with open(INPUT_FILE, "w") as f:
for i, text in enumerate(documents):
f.write(json.dumps({
"custom_id": f"doc-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": MODEL,
"messages": [{"role": "system", "content": SYSTEM},
{"role": "user", "content": text}],
},
}) + "\n")
# Submit Batch Job
batch_file = client.files.create(file=open(INPUT_FILE, "rb"), purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
metadata={"description": "supermemory bulk pre-extraction"},
)
print(f"Created batch {batch.id} -> {batch.status}")Run python batch.py. You can monitor the job in the Doubleword console. Once complete, pull the results and inject the raw facts directly into Supermemory.
Limitations & Considerations
- Flex Latency: The Flex tier queues work, so extracted facts appear after a short delay rather than instantly. Keep your generation on real-time; use Flex strictly for background extraction.
- Billing Visibility: Doubleword's API doesn't echo the applied tier in the standard OpenAI-compatible response, so verify your Flex billing savings directly in the Doubleword console.
- Local Embeddings Only: Supermemory's self-hosted version hard-codes embeddings to the local bge-base model. The
OPENAI_environment variables currently only route the heavy LLM extraction workload to Doubleword.
Ready to scale your agent's memory for cents on the dollar? Grab a Doubleword API key and integrate it today.