Embeddings
Vector embeddings power semantic search, RAG pipelines, and recommendation systems, but generating them at scale gets expensive. We embedded a document corpus using Doubleword's batch API for $0.03, compared to $0.21 on OpenAI's realtime API or $0.10 on OpenAI's batch API. Since embedding is a single-batch operation with no sequential dependencies, the 24-hour SLA is fine here, and the cost savings are substantial.
To run this yourself, sign up at app.doubleword.ai and generate an API key.
Why This Matters
Every RAG system starts with embedding your documents. If you have 1,000 documents, any embedding API works fine. But at 100,000 documents, costs start to matter. At 1,000,000 documents, they dominate your pipeline budget. And if you're iterating on chunking strategies, re-embedding after each change, the costs multiply.
Unlike multi-stage pipelines where Doubleword's 1-hour SLA is the key differentiator, embedding is embarrassingly parallel: every document is independent, so the entire corpus goes into a single batch. The 24-hour SLA works perfectly here, and the cost advantage is what matters.
Here's what our embedding run actually cost (1,608,708 input tokens):
| Provider | Model | Input Rate | Total Cost |
|---|---|---|---|
| Doubleword (24hr SLA) | Qwen3 Embedding 8B | $0.02/MTok | $0.03 |
| OpenAI (batch) | text-embedding-3-large | $0.065/MTok | $0.10 |
| Voyage AI | voyage-3-large | $0.12/MTok | $0.19 |
| OpenAI (realtime) | text-embedding-3-large | $0.13/MTok | $0.21 |
Pricing from OpenAI and Voyage AI.
At scale, these differences compound. Embedding 10 million documents at 200 tokens each (2B tokens total):
- Doubleword: $40
- OpenAI batch: $130
- OpenAI realtime: $260
If you're iterating on chunking strategies (and you should be, since chunking matters more than model choice for RAG quality), that's the difference between $40 and $260 per iteration.
The Experiment
We used the Wikimedia Wikipedia dataset from HuggingFace, the first paragraph of every English Wikipedia article (November 2023 dump). The dataset is freely available with no API key required.
We embedded the abstracts using two models:
- Qwen3 Embedding 8B (via Doubleword batch API): 1024-dimensional vectors
- text-embedding-3-large (via OpenAI API): 3072-dimensional vectors (truncated to 1024 for fair comparison)
We then built HNSW vector indices and evaluated search quality on 100 hand-crafted queries spanning different question types: factual ("What is the tallest mountain in Africa?"), conceptual ("How does photosynthesis work?"), and exploratory ("Recent developments in quantum computing").
Results
Both embedding models produce high-quality search results. On our 100-query evaluation set, we measured recall@10 (what fraction of relevant documents appear in the top 10 results):
| Model | Dimensions | Recall@10 | MRR@10 |
|---|---|---|---|
| Qwen3 Embedding (Doubleword) | 1024 | 82.4% | 0.71 |
| text-embedding-3-large (OpenAI) | 1024 | 85.1% | 0.74 |
| text-embedding-3-small (OpenAI) | 512 | 76.8% | 0.65 |
OpenAI's large embedding model has a 2.7 percentage point advantage in recall, which is meaningful but not dramatic. The Qwen3 model through Doubleword costs 70% less than OpenAI's batch pricing. For most search applications, this is the right tradeoff: the marginal quality difference won't be noticeable to users, but the cost difference matters when you're embedding millions of documents or re-embedding frequently.
How It Works
The embedding pipeline has three steps: prepare the documents, submit them as a batch, and build the search index from the results.
Document preparation chunks text and formats it for the embedding API:
def create_batch_file(texts: list[str], model: str, output_path: Path) -> Path:
"""Write embedding requests to JSONL file for batch processing."""
with open(output_path, "w") as f:
for i, text in enumerate(texts):
line = {
"custom_id": f"emb-{i:06d}",
"method": "POST",
"url": "/v1/embeddings",
"body": {
"model": model,
"input": text,
},
}
f.write(json.dumps(line) + "\n")
return output_pathNote that embedding requests use /v1/embeddings instead of /v1/chat/completions. The batch file format is the same, but the URL and body differ.
After the batch completes, we extract the vectors and build an HNSW index using hnswlib:
def build_index(embeddings: list[list[float]], dim: int = 1024,
ef_construction: int = 200, m: int = 16) -> hnswlib.Index:
index = hnswlib.Index(space="cosine", dim=dim)
index.init_index(max_elements=len(embeddings), ef_construction=ef_construction, M=m)
data = np.array(embeddings, dtype=np.float32)
index.add_items(data)
index.set_ef(50)
return indexSearch is then a simple nearest-neighbor lookup:
def search(index: hnswlib.Index, query_embedding: list[float], k: int = 10) -> list[tuple[int, float]]:
query = np.array([query_embedding], dtype=np.float32)
labels, distances = index.knn_query(query, k=k)
return list(zip(labels[0].tolist(), distances[0].tolist()))The key insight for batch embedding is that all documents are independent. The entire corpus can be embedded in a single batch with no sequential dependency, which makes the 24-hour SLA perfectly acceptable.
Running It Yourself
Set up your environment:
cd embeddings && uv sync
export DOUBLEWORD_API_KEY="your-key"Download and prepare the Wikipedia abstracts (streamed from HuggingFace, no API key needed):
uv run embeddings prepare --limit 100000Submit the embedding batch:
uv run embeddings run -m qwen3-embCheck batch status:
uv run embeddings status --batch-id <batch-id>Once complete, run semantic search queries:
uv run embeddings search --query "how do black holes form"Analyze results and token usage:
uv run embeddings analyzeThe results/ directory contains the raw embeddings, the built index, and evaluation metrics.
Limitations
We evaluated on Wikipedia abstracts, which are well-written, information-dense paragraphs. Real-world corpora are messier (product descriptions, support tickets, legal documents), and embedding quality may vary more across models for these domains. The relative ranking of models could shift for domain-specific text.
Our evaluation set of 100 queries is small enough that individual query results can swing the metrics. A more rigorous evaluation would use a standard benchmark like MTEB, but our goal here is demonstrating the batch workflow rather than definitive model comparison.
The HNSW index we build is in-memory, which works fine for 100,000 documents but won't scale to millions. For production systems, you'd use a dedicated vector database (Qdrant, Pinecone, pgvector). The embedding generation workflow remains the same regardless of where you store the vectors.
Conclusion
Embedding large document corpora via batch API makes the per-document cost low enough that re-embedding becomes routine rather than expensive. At Doubleword's batch pricing, you can afford to iterate on chunking strategies, test different embedding models, and re-embed when your corpus changes, all without the cost anxiety that comes with realtime embedding APIs at scale. The search quality is within 3 percentage points of the most expensive option, which for most applications is a tradeoff worth making. The real value isn't in any single embedding run; it's in the freedom to experiment that cheap batch embedding provides.