DoublewordDoubleword

Embeddings

Vector embeddings power semantic search, RAG pipelines, and recommendation systems, but generating them at scale gets expensive. We embedded a document corpus using Doubleword's batch API for $0.03, compared to $0.21 on OpenAI's realtime API or $0.10 on OpenAI's batch API. Since embedding is a single-batch operation with no sequential dependencies, the 24-hour SLA is fine here, and the cost savings are substantial.

To run this yourself, install the dw CLI and dw login, or sign up at app.doubleword.ai.

Why This Matters

Every RAG system starts with embedding your documents. If you have 1,000 documents, any embedding API works fine. But at 100,000 documents, costs start to matter. At 1,000,000 documents, they dominate your pipeline budget. And if you're iterating on chunking strategies, re-embedding after each change, the costs multiply.

Unlike multi-stage pipelines where Doubleword's 1-hour SLA is the key differentiator, embedding is embarrassingly parallel: every document is independent, so the entire corpus goes into a single batch. The 24-hour SLA works perfectly here, and the cost advantage is what matters.

Here's what our embedding run actually cost (1,608,708 input tokens):

ProviderModelInput RateTotal Cost
Doubleword (24hr SLA)Qwen3 Embedding 8B$0.02/MTok$0.03
OpenAI (batch)text-embedding-3-large$0.065/MTok$0.10
Voyage AIvoyage-3-large$0.12/MTok$0.19
OpenAI (realtime)text-embedding-3-large$0.13/MTok$0.21

Pricing from OpenAI and Voyage AI.

At scale, these differences compound. Embedding 10 million documents at 200 tokens each (2B tokens total):

  • Doubleword: $40
  • OpenAI batch: $130
  • OpenAI realtime: $260

If you're iterating on chunking strategies (and you should be, since chunking matters more than model choice for RAG quality), that's the difference between $40 and $260 per iteration.

The Experiment

We used the Wikimedia Wikipedia dataset from HuggingFace, the first paragraph of every English Wikipedia article (November 2023 dump). The dataset is freely available with no API key required.

We embedded the abstracts using two models:

  • Qwen3 Embedding 8B (via Doubleword batch API): 1024-dimensional vectors
  • text-embedding-3-large (via OpenAI API): 3072-dimensional vectors (truncated to 1024 for fair comparison)

We then built HNSW vector indices and evaluated search quality on 100 hand-crafted queries spanning different question types: factual ("What is the tallest mountain in Africa?"), conceptual ("How does photosynthesis work?"), and exploratory ("Recent developments in quantum computing").

Results

Both embedding models produce high-quality search results. On our 100-query evaluation set, we measured recall@10 (what fraction of relevant documents appear in the top 10 results):

ModelDimensionsRecall@10MRR@10
Qwen3 Embedding (Doubleword)102482.4%0.71
text-embedding-3-large (OpenAI)102485.1%0.74
text-embedding-3-small (OpenAI)51276.8%0.65

OpenAI's large embedding model has a 2.7 percentage point advantage in recall, which is meaningful but not dramatic. The Qwen3 model through Doubleword costs 70% less than OpenAI's batch pricing. For most search applications, this is the right tradeoff: the marginal quality difference won't be noticeable to users, but the cost difference matters when you're embedding millions of documents or re-embedding frequently.

How It Works

The embedding pipeline has three steps: prepare the documents, submit them as a batch, and build the search index from the results.

Document preparation chunks text and formats it for the embedding API:

def create_batch_file(texts: list[str], model: str, output_path: Path) -> Path:
    """Write embedding requests to JSONL file for batch processing."""
    with open(output_path, "w") as f:
        for i, text in enumerate(texts):
            line = {
                "custom_id": f"emb-{i:06d}",
                "method": "POST",
                "url": "/v1/embeddings",
                "body": {
                    "model": model,
                    "input": text,
                },
            }
            f.write(json.dumps(line) + "\n")
    return output_path

Note that embedding requests use /v1/embeddings instead of /v1/chat/completions. The batch file format is the same, but the URL and body differ.

After the batch completes, we extract the vectors and build an HNSW index using hnswlib:

def build_index(embeddings: list[list[float]], dim: int = 1024,
                ef_construction: int = 200, m: int = 16) -> hnswlib.Index:
    index = hnswlib.Index(space="cosine", dim=dim)
    index.init_index(max_elements=len(embeddings), ef_construction=ef_construction, M=m)
    data = np.array(embeddings, dtype=np.float32)
    index.add_items(data)
    index.set_ef(50)
    return index

Search is then a simple nearest-neighbor lookup:

def search(index: hnswlib.Index, query_embedding: list[float], k: int = 10) -> list[tuple[int, float]]:
    query = np.array([query_embedding], dtype=np.float32)
    labels, distances = index.knn_query(query, k=k)
    return list(zip(labels[0].tolist(), distances[0].tolist()))

The key insight for batch embedding is that all documents are independent. The entire corpus can be embedded in a single batch with no sequential dependency, which makes the 24-hour SLA perfectly acceptable.

Running It Yourself

Using the Doubleword CLI

Install the dw CLI and log in:

dw login

Clone, setup, and see the full workflow:

dw examples clone embeddings
cd embeddings
dw project setup
dw project info

The fastest way to run everything end-to-end:

dw project run-all

Or run each step manually for more control:

Generate the embedding batch. This downloads Wikipedia abstracts and creates a JSONL file:

dw project run prepare -- -n 10000

Inspect and set the model:

dw files stats batches/batch.jsonl
dw files prepare batches/batch.jsonl --model Qwen/Qwen3-Embedding-8B

Submit the batch and watch progress:

dw batches run batches/batch.jsonl --watch --output-id .batch-id

Download results and build the search index:

dw batches results --from-file .batch-id -o results/embeddings.jsonl
dw project run build-index -- -r results/embeddings.jsonl

Search:

dw project run search -- -q "how do black holes form"

Check what it cost:

dw batches analytics --from-file .batch-id

Sample first, then scale

For a quick test before embedding the full corpus:

dw files sample batches/batch.jsonl -n 100 -o batches/sample.jsonl
dw files prepare batches/sample.jsonl --model Qwen/Qwen3-Embedding-8B
dw batches run batches/sample.jsonl --watch --output-id .batch-id
dw batches results --from-file .batch-id -o results/sample.jsonl

The results/ directory contains the raw embeddings, the built HNSW index, and search metadata.

Limitations

We evaluated on Wikipedia abstracts, which are well-written, information-dense paragraphs. Real-world corpora are messier (product descriptions, support tickets, legal documents), and embedding quality may vary more across models for these domains. The relative ranking of models could shift for domain-specific text.

Our evaluation set of 100 queries is small enough that individual query results can swing the metrics. A more rigorous evaluation would use a standard benchmark like MTEB, but our goal here is demonstrating the batch workflow rather than definitive model comparison.

The HNSW index we build is in-memory, which works fine for 100,000 documents but won't scale to millions. For production systems, you'd use a dedicated vector database (Qdrant, Pinecone, pgvector). The embedding generation workflow remains the same regardless of where you store the vectors.

Conclusion

Embedding large document corpora via batch API makes the per-document cost low enough that re-embedding becomes routine rather than expensive. At Doubleword's batch pricing, you can afford to iterate on chunking strategies, test different embedding models, and re-embed when your corpus changes, all without the cost anxiety that comes with realtime embedding APIs at scale. The search quality is within 3 percentage points of the most expensive option, which for most applications is a tradeoff worth making. The real value isn't in any single embedding run; it's in the freedom to experiment that cheap batch embedding provides.