Skip to main content

14 posts tagged with "ai-infrastructure"

View All Tags

Why Batch Inference Matters: Moving from AI Assistants to Autonomous Agents

· 5 min read
Amanda Milberg
Principal Solutions Engineer, Doubleword

The initial wave of Generative AI adoption focused on augmenting human work - chatbots that help developers write cleaner code, assistants that polish our emails, or tools that speed up content creation. These productivity enhancements have proven their value tenfold, as almost every individual has a version of ChatGPT open to assist them during their day. But they represent just the beginning of what's possible with AI.

The next evolution isn't about making humans faster - it's about trusting AI to work independently. As organizations move from interactive chatbots to autonomous agents, the shift requires a fundamental change in mindset. Can we let an LLM handle entire workflows without constant human supervision? The most significant business transformation happens when AI becomes a trusted teammate that takes ownership of repetitive, high-volume tasks and delivers results on its own timeline.

Understanding Batch Workloads

Not every AI task needs an answer in milliseconds. There's an entire category of work where immediacy isn't the priority - processing happens in the background, often overnight or on a schedule, with results delivered when they are needed rather than instantly. These are what we call batch workloads: "fire-and-forget" AI tasks where no one is waiting at their screen, no human is in the loop requiring immediate feedback, and the work completes on its own timeline without the constraints of real-time interaction.

Common examples include:

  • Nightly content moderation sweeps across millions of user posts, flagging policy violations for human review the next morning
  • Daily research literature analysis that processes hundreds of new papers, extracting key findings and relevance scores for researchers
  • Weekly customer feedback analysis that categorizes and summarizes thousands of support tickets, identifying emerging issues and sentiment trends
  • Monthly document processing for compliance teams, where contracts or reports are analyzed for specific clauses, risks, or anomalies

To unlock greater business value from AI, organizations need to move beyond real-time, human-driven workloads. As illustrated in the diagram below, workloads that operate with more autonomy - processing data independently on their own schedule - will deliver higher value to the organization.

image

The Economics of Batch Inference

These autonomous workloads - often called batch inference - represent a fundamental shift in how AI delivers value. Instead of processing requests one at a time as users wait, batch workloads process thousands or millions of inputs in parallel. This approach unlocks significant cost advantages, as providers, such as Doubleword, can offer up to 75% discounts for batch processing.

More importantly, batch inference enables AI to tackle problems that would be economically impossible in real-time - like analyzing every customer interaction from the past month, processing an entire claims backlog, or screening thousands of medical images overnight. The economics are compelling - the same AI capability that costs $X per real-time query costs a fraction of that when run as a scheduled batch job, making previously cost-prohibitive use cases suddenly viable at scale.

Let's walk through a practical example to see how a single use case evolves from interactive assistance to autonomous operation.

Illustrative Use Case: AI Powered Radiology Assistance

  1. Stage 1 - Interactive Assistant: Radiologists use an AI chatbot while reviewing scans, asking questions and receiving real-time suggestions. This accelerates individual reviews but remains constrained by human speed and attention.
  2. Stage 2 - Scheduled Triage: The department shifts to overnight batch processing. The AI analyzes hundreds of scans, flags critical findings, and prioritizes the morning worklist. Urgent cases are surfaced first and potential abnormalities are pre-identified, making their reviews more efficient.
  3. Stage 3 - Draft Generation: As trust in the system grows, the AI generates complete draft reports - findings, measurements, preliminary diagnoses. The AI handles 80% of the analytical work; humans review and approve AI generated work, ensuring accuracy and addressing edge cases
  4. Stage 4 - Autonomous Screening: Eventually, after robust evaluation and with proper guardrails, the AI operates fully autonomously - processing thousands of scans weekly, automatically clearing normal cases, and only surfacing abnormal findings to specialists.

Taking Action: Building for Autonomous Workloads

The path to transformational AI value requires a strategic shift in how organizations architect their AI systems. If your team is still primarily using AI through interactive chatbots and real-time assistants, you're likely leaving significant business value on the table. The question isn't whether to move toward autonomous workloads - it's how quickly you can make that transition while maintaining quality and trust.

As organizations begin deploying these autonomous AI pipelines, the technical requirements become critical. You need consistent performance across large job queues, transparent monitoring and error handling, and pricing models that make high-volume processing economically viable. Doubleword is built specifically to meet these requirements - offering batch inference infrastructure designed for reliability at scale.

Ready to explore autonomous AI workloads for your organization?

We're opening a private preview for companies looking to move beyond real-time inference and unlock the full potential of batch processing. Sign up here to learn how to architect reliable, cost-effective AI systems that operate at scale.

Benchmarking the Doubleword Control Layer

· 14 min read
Fergus Finn
Founder & Member of Technical Staff, Doubleword

Control Layer Benchmarking

Benchmarking is hard.

We think our Control Layer (dwctl) is the fastest AI gateway around. We believe this because it's written in Rust1, and because we thought about performance a lot while we were building it. We put it in production in our self-hosted inference stack, and we knew that it was fast because we didn't notice it.

It's so good that we are open sourcing it. And once it's out there, it can be used in lots of different places, in lots of different ways. And so, to prove that it will be fast everywhere, we have to do benchmarking2.

Footnotes

  1. And therefore blazing fast.

  2. The usual caveats about general case benchmarks apply: the only realistic benchmarks are built by you, the user, since only you know what your application looks like. Every highly technical business for whom performance is a proof point eventually releases a weary blog post talking about how performance is multifaceted and can't be captured by simple benchmarks. See here, here, here, here for interesting content.

Understanding Chargeback in the Context of Self-Hosted Systems

· 7 min read
Amanda Milberg
Principal Solutions Engineer, Doubleword

Introduction

When technology infrastructure—such as GPUs and servers—is owned and managed by a central IT team, the need to allocate costs back to the business units that benefit from these resources becomes a critical consideration. This is particularly relevant in the context of self-hosting AI models, where the initial investment in high-performance GPUs, servers, and supporting infrastructure can be substantial. Without a clear chargeback mechanism, it becomes difficult to ensure accountability, optimize resource usage, and justify the ROI of such investments.

So, how do you design a chargeback system that is scalable, transparent, and easy to manage as your organization grows from supporting a handful of users to thousands of downstream business units? In this guide, we’ll explore how to architect and implement a chargeback system that not only integrates seamlessly with your existing AI infrastructure but also provides clear visibility into costs and benefits. By doing so, you can ensure that the value of your AI investments is both measurable and aligned with business goals.

Choosing the Right Model for the Use Case

· 6 min read
Amanda Milberg
Principal Solutions Engineer, Doubleword

Introduction

Selecting the right AI model for deployment is a critical decision that can significantly impact the performance, cost, and user experience of your application. With a wide variety of models available—each with unique strengths and trade-offs—it’s essential to evaluate them carefully based on relevant criteria. In this post, we’ll explore the three key factors to consider when comparing models for deployment: quality, cost, and speed. Understanding how these factors interact and influence your application will help you make informed choices that align with your technical requirements and business goals

Behind the Stack, Ep 10: Batched Endpoints

· 4 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction: The Cost Challenge in LLM Workloads

Running LLMs at scale can be expensive. Whether you’re building customer-facing chatbots, document extraction pipelines, or research tools, token usage can balloon into thousands of dollars quickly. While infrastructure teams often focus on throughput optimizations (batching requests on the GPU, prefix caching, etc.), there’s another lever to pull: endpoint design. One of the most powerful - and under-discussed - endpoint types is the batched endpoint. Instead of prioritizing instant responses, batched endpoints trade latency for cost, cutting your LLM bill in half (or more in some cases).

In this blog, we’ll cover:

  • What batched endpoints are and how they differ from standard APIs
  • How providers reduce costs behind the scenes
  • Advanced optimization strategies (spot instances, prefix caching, request reordering)
  • How to self-host your own batched endpoint

Behind the Stack, Ep 9: How to Evaluate Open Source LLMs

· 4 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction: The Hidden Challenge in LLM Selection

Choosing the right LLM for your workload isn’t just about picking the latest open-source release or switching to a cheaper closed model. If you’re self-hosting language models - whether for RAG pipelines, agents, or fine-tuned data tasks - knowing how good a model is (and compared to what) is a critical decision.

Most teams rely on academic benchmarks like MMLU, ARC, or HumanEval. But these don’t always reflect real-world usage. Benchmark scores may go up while actual task performance stays flat.

The only way to evaluate models with complete confidence would be to build an in-house evaluation pipeline tailored to your exact use case. That means defining your task - whether it's data extraction, question answering, or multi-step reasoning - then collecting example documents, crafting queries, running each model in a controlled environment, and comparing results against a gold standard set you’ve manually verified.

This lets you directly compare open and closed-source models on your terms. But there's a catch: it’s incredibly time-consuming, complex, and expensive to do well.

Behind the Stack, Ep 8: Choosing the Right Inference Engine for Your LLM Deployment

· 5 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction: The Hidden Cost of Choosing the Wrong Inference Engine

Inference engines are the backbone of self-hosted LLM stacks. They’re responsible for turning model weights into real-time, token-by-token output.

But here's the trap: most people choose one based on benchmark scores - and completely miss the bigger picture.

In reality, the best inference engine for your deployment depends on who’s using it, where it’s running, and how often it’s being called. That means the trade-offs between engines like Llama.cpp and vLLM go far beyond just speed. While the Doubleword Stack supports all major inference engines, selecting the best one still depends on your specific workload characteristics.

In this guide, we break down:

  • The two major deployment patterns for LLM inference
  • What each pattern demands from your engine
  • Which open-source projects are optimized for each
  • And how to choose the right engine for your stack

Behind the Stack, Ep 7: Choosing the Right Quantization for Self-Hosted LLMs

· 5 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction: Quantization Isn’t Just About Memory - It’s About Making LLMs Practical

Large Language Models (LLMs) are incredibly powerful but also incredibly resource-hungry. Running them efficiently, especially on self-hosted infrastructure, requires squeezing every bit of performance out of limited compute and memory. That’s where quantization comes in.

At its core, quantization is the process of reducing the precision of numerical values in a model - from 16-bit floats to 8-bit, 4-bit, or even lower. This seemingly simple change has huge implications: lower memory usage, faster inference, and reduced costs.

It typically applies to two things:

  • Weights: the learned, static parameters of the model
  • Activations: the dynamic, intermediate values produced at each layer as the model processes input

Activations vary with every inference and can consume significant memory - especially for long prompts - while weights remain fixed. Compressing either (or both) can bring efficiency gains, but with different trade-offs.

And here’s the catch: not all quantization methods benefit all workloads equally. Choosing between weight-only quantization and full weight+activation quantization isn’t just a technical decision - it’s a strategic one that depends on your model architecture, input/output patterns, and the hardware you’re running on.

This blog walks through how to choose the right quantization strategy for your specific use case - so you can cut costs and improve performance without falling into common traps.

Behind the Stack, Ep 6: How to Speed up the Inference of AI Agents

· 6 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction: The Latency Problem in AI Agents

AI agents are transforming everything from customer support to autonomous workflows. But under the hood, most AI agent architectures suffer from one major problem: growing latency and cost at scale.

Each reasoning step adds more tokens to the input, and because most systems (especially API-based or naive self-hosted setups) resend the entire prompt history on every call, AI agents end up:

  • Repeating compute from earlier steps
  • Wasting GPU cycles
  • Scaling inference cost and latency quadratically Even modern caching APIs fall short - they don’t cache intermediate thoughts, tool results, or agent memory effectively.

The Solution? Prefix Caching for AI Agents

Prefix caching is a feature available in advanced self-hosted AI inference engines like vLLM, SGLang, and TGI. It allows your AI agents to reuse previously computed context efficiently, cutting down latency and cost - without changing the logic of your agent.

In this post, you’ll learn:

  • Why traditional AI agent chains are inefficient
  • How prefix caching works inside LLM inference
  • When and how to deploy it
  • What infrastructure patterns support it best

If you're running multi-step AI agents, this is a foundational optimization strategy.

Behind the Stack, Ep 5: Making RAG Work for Multimodal Documents

· 5 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction

Most retrieval-augmented generation (RAG) systems assume that documents are clean, structured, and text-based. But in enterprise environments, the reality is different. Documents often contain:

  • Tables with nested headers, merged cells, or embedded footnotes
  • Charts and images that convey critical insights
  • Layout-heavy formats like invoices, reports, or scanned documents

When such content passes through standard RAG pipelines, the results are often poor - irrelevant retrieval and hallucinated outputs during generation.

This post explores practical strategies to enable accurate retrieval and grounded generation from messy, multimodal documents. We focus on two key stages:

  1. Retrieval – How to index and surface relevant content that isn’t just plain text
  2. Generation – How to present structured or visual content to an LLM for high-quality answers We’ll cover proven architectures, model recommendations, and implementation details used in real-world production systems.