mem0
Adding long-term memory to your AI agents fundamentally changes the user experience. Instead of starting from scratch every session, memory-enabled agents recall past preferences, ongoing projects, and specific user quirks.
The leading framework for this is Mem0. However, building and maintaining that memory layer is incredibly compute-intensive. Mem0 relies on an LLM to extract facts from unstructured text, deduplicate overlapping concepts, and synthesize historical context into a clean user profile.
If you are just launching a memory-enabled agent, you likely have thousands of historical customer support logs or chat transcripts you want to ingest so user profiles aren't empty on day one. Running that bulk extraction workload synchronously against a premium real-time endpoint (like GPT-4o or Claude 3.5 Sonnet) introduces two massive bottlenecks:
- Crippling Rate Limits: Firing thousands of heavy extraction prompts simultaneously will inevitably trigger
429 Too Many Requestserrors. - Burned Inference Budget: You are paying a massive premium for real-time latency on a background data-processing task nobody is waiting on.
The Solution: By routing Mem0's extraction engine through Doubleword's asynchronous Flex tier, you can process your entire historical backlog for at half the cost. Furthermore, Doubleword's Flex tier uses native server-side queuing, meaning it completely absorbs the concurrency and you don't have to write a single line of backoff or retry logic.
Agent Memory Architecture: Splitting the Hot and Cold Paths
The secret to cost-effective agent memory is decoupling your read and write operations:
- The Read (Hot Path): When a user sends a message, your agent needs to retrieve relevant memories instantly to formulate a response. This stays fast and real-time.
- The Write & Consolidate (Cold Path): Extracting facts from the user's message, updating their profile, and running nightly deduplication sweeps are offline tasks. The user is not waiting for this to finish. This belongs on Doubleword's Flex tier.
This guide covers the cold path: bulk-ingesting your history safely and cheaply.
The Cost of Scaling Mem0: Real-Time vs. Flex Tier
We measured a real Mem0 extraction running on Doubleword, then extrapolated to 10,000 transcripts. Two things drive the cost: Mem0's extraction prompt is large (about 8,000 input tokens per transcript), and DeepSeek-V4-Pro is a reasoning model that returns roughly 5,400 output tokens per extraction.
| Extraction LLM | Execution Tier | Cost per 10k Transcripts | Rate Limit Risk |
|---|---|---|---|
| Anthropic Claude 3.5 Sonnet | Real-time Synchronous | ~$345.00 | High (requires backoff) |
| OpenAI GPT-4o | Real-time Synchronous | ~$270.00 | High (requires backoff) |
| Doubleword (DeepSeek-V4-Pro) | Async (Flex) | ~$127.00 | None (Native queuing) |
On the Flex tier, DeepSeek-V4-Pro runs the exact same extraction for less than half the cost of GPT-4o and roughly a third of Claude 3.5 Sonnet, with superior extraction quality.
(Note: Need it even cheaper? Swap the model to DeepSeek-V4-Flash. It extracted the same transcript cleanly using fewer tokens and runs at about $7 per 10,000 transcripts, trading some nuance for a 97% cost reduction vs GPT-4o.)
Prerequisites
- Python 3.10 or newer.
- The uv package manager.
- A Doubleword API key. Create one at app.doubleword.ai/api-keys.
Export it so the script can read it:
export DOUBLEWORD_API_KEY="sk-..."(New accounts get free credits when you sign up, so the run below should cost you almost nothing to try.)
How to Configure Mem0 for Doubleword's Async Flex Tier
Mem0's standard extraction calls the chat-style LLM interface, but Doubleword's massive cost savings live on the Responses API (responses.create(..., service_tier="flex")).
Because Mem0 currently uses a strict internal list of allowed providers, we don't rewrite the core library. Instead, we add a lightweight adapter that translates Mem0's extraction call into a Flex Responses call, registering it under the standard openai provider name.
Step 1: Install dependencies
uv pip install mem0ai openai qdrant-client python-dotenvStep 2: The Doubleword Flex Adapter
Save this as doubleword_mem0.py. It subclasses Mem0's OpenAI LLM so extraction runs on the flex tier.
# doubleword_mem0.py
from mem0.llms.openai import OpenAILLM
from mem0.utils.factory import LlmFactory
FLEX_TIMEOUT = 900.0 # flex queues work; block, don't poll
MAX_OUTPUT_TOKENS = 8192 # headroom for reasoning tokens
class FlexResponsesLLM(OpenAILLM):
"""Runs Mem0's fact extraction on Doubleword's flex tier (Responses API)."""
def __init__(self, config=None):
super().__init__(config)
self.client = self.client.with_options(timeout=FLEX_TIMEOUT)
def generate_response(self, messages, response_format=None, tools=None,
tool_choice="auto", **kwargs):
prompt = "\n\n".join(f"{m['role']}: {m['content']}" for m in messages)
resp = self.client.responses.create(
model=self.config.model,
input=prompt,
service_tier="flex",
max_output_tokens=MAX_OUTPUT_TOKENS,
)
return resp.output_text
def use_flex():
"""Swap Mem0's 'openai' LLM provider to the flex tier (this process)."""
_, config_cls = LlmFactory.provider_to_class["openai"]
LlmFactory.provider_to_class["openai"] = ("doubleword_mem0.FlexResponsesLLM", config_cls)Step 3: Create the Ingestion Script
Save this as ingest.py, next to doubleword_mem0.py. We point the LLM and embedder at Doubleword's endpoint, call use_flex() so extraction runs on the Flex tier, and fan out the writes with asyncio.gather. A semaphore caps in-flight requests; Doubleword natively queues the rest.
# ingest.py
import os
import asyncio
from mem0 import AsyncMemory
from doubleword_mem0 import use_flex
os.environ["OPENAI_API_KEY"] = os.environ["DOUBLEWORD_API_KEY"] # Mem0 reads OPENAI_API_KEY
use_flex() # extraction -> Doubleword flex tier
config = {
"llm": {
"provider": "openai",
"config": {
"openai_base_url": "https://api.doubleword.ai/v1",
"model": "deepseek-ai/DeepSeek-V4-Pro",
"max_tokens": 8192, # reasoning models think first, give headroom
},
},
"embedder": { # embeddings run realtime (realtime + batch available; no flex)
"provider": "openai",
"config": {
"openai_base_url": "https://api.doubleword.ai/v1",
"model": "Qwen/Qwen3-Embedding-8B",
},
},
"vector_store": {
"provider": "qdrant",
"config": {
"path": "/tmp/qdrant_agent_memory",
"collection_name": "agent_memory",
"embedding_model_dims": 4096, # Qwen3-Embedding-8B is 4096-dim
},
},
}
m = AsyncMemory.from_config(config)
# Your historical data (load from a CSV, database, etc.)
historical_logs = [
{"user_id": "usr_123", "text": "I'm migrating our backend from Node to Go. We use AWS."},
{"user_id": "usr_123", "text": "I hate dark mode, please ensure my dashboard stays light."},
{"user_id": "usr_456", "text": "My budget for the Q3 campaign is strictly capped at $50k."},
# ... thousands of historical rows ...
]
CONCURRENCY = 32 # in-flight extractions; Doubleword queues the rest
sem = asyncio.Semaphore(CONCURRENCY)
async def ingest(log):
async with sem:
await m.add([{"role": "user", "content": log["text"]}], user_id=log["user_id"])
async def main():
print(f"Queueing {len(historical_logs)} logs for memory extraction via Doubleword...")
await asyncio.gather(*(ingest(log) for log in historical_logs))
print("Bulk memory extraction complete!\n")
stored = await m.get_all(filters={"user_id": "usr_123"})
for memory in stored["results"]:
print(f"Stored Fact: {memory['memory']}")
if __name__ == "__main__":
asyncio.run(main())Executing the Async Memory Pipeline
Run the ingestion script from your terminal:
python ingest.pyYou will see the facts Mem0 extracted and stored, each produced reliably on Doubleword's Flex tier without a single rate-limit failure. Output looks something like this (because the model intelligently synthesizes and consolidates facts, exact wording varies):
Queueing 3 logs for memory extraction via Doubleword...
Bulk memory extraction complete!
Stored Fact: User is migrating their backend from Node.js to Go and uses AWS
Stored Fact: User prefers light mode and dislikes dark mode for their dashboardNext Steps
- Nightly Consolidation: As users interact with your agent, memory fragments inevitably pile up. Run a nightly cron job with this exact config to consolidate overlapping memories during off-peak hours, relying on the Flex tier rather than burning your real-time budget.
- Scale Your Graph: Because your extraction costs are now a fraction of a cent per log, you no longer need to strictly filter which conversations get vectorized. You can afford to pass your entire organizational chat history into Mem0.
- Pure Offline Batching: For one-shot transforms that don't need Mem0's synchronous add/update logic at all, Doubleword's traditional Batch API (24h SLA) is even cheaper. Pre-extract facts via batch, then load the clean JSON directly into your store.
Ready to scale your agent's memory? Grab a Doubleword API key and drop it into your infrastructure today.