Skip to main content

Efficient Batch and Interactive LLM Inference at Scale with llm-d

ยท 5 min read
Lior Aronovich
Lior Aronovich
Senior Principal Software Engineer, Red Hat
Raymond Zhao
Raymond Zhao
Principal Software Engineer, Red Hat
Jooyeon Mok
Jooyeon Mok
Software Engineer, Red Hat
Nili Guy
Nili Guy
R&D Manager, AI Infrastructure, IBM

As organizations deploy AI applications in production, their inference infrastructure must serve two fundamentally different workloads simultaneously: interactive requests requiring immediate replies, and batch inference jobs that process thousands of requests with a time tolerance of hours for receiving results. Use cases for batch inference include autonomous background agents performing multi-step reasoning and deep research, as well as user-initiated workloads like offline evaluations, dataset processing, and embedding generation.

In batch inference, the goal is to maximize throughput across a large volume of requests while meeting defined completion time targets, without interfering with interactive traffic. Non-urgent inference can fill GPU capacity during periods of lower interactive traffic, increasing infrastructure utilization. Users can also take advantage of differential billing between batch and interactive workloads for cost-optimized processing.

Batch Gateway brings first-class batch inference capabilities to llm-d. It provides an OpenAI-compatible API for submitting, tracking, and managing large-scale batch jobs, running efficiently alongside interactive inference workloads on shared infrastructure. With OpenAI API compatibility, users can migrate existing OpenAI batch scripts with minimal changes.

The challenge: batch and interactive workloads on shared infrastructureโ€‹

When batch and interactive inference workloads compete for the same GPU resources without purpose-built tools, the outcomes are typically poor:

  • Letting batch workloads degrade interactive performance is unacceptable for production services.
  • Batch requests evict KV-cache entries needed by interactive workloads, forcing costly prefill reconstruction.
  • Dedicating separate GPU pools for batch workloads is expensive and wasteful.
  • Manually throttling batch workloads is operationally burdensome.

Batch Gateway is designed to solve this through built-in adaptive concurrency control and integration with llm-d's routing and scheduling components. Together, these mechanisms dynamically adjust batch flow based on available capacity, protect interactive traffic under load, and prioritize jobs based on SLO targets. The result is that batch jobs make steady progress toward their completion targets without interfering with interactive traffic.

Batch Gateway is production-grade, designed for shared multi-tenant environments where security, reliability, observability, and SLO compliance are essential.

Batch Gateway is part of llm-d, a CNCF Sandbox project and open source Kubernetes-native framework for high-performance distributed LLM inference. Batch Gateway integrates with llm-d's components, which means that batch workloads automatically benefit from llm-d's efficient inference capabilities, such as intelligent request routing, flow-control, and KV-cache reuse.

How Batch Gateway worksโ€‹

Batch Gateway is a Kubernetes-native system composed of several components:

Batch Gateway architecture diagram

API Serverโ€‹

The API server exposes OpenAI-compatible /v1/batches and /v1/files endpoints, providing the same interface that users and applications already use for batch processing.

Data layerโ€‹

Batch Gateway uses pluggable storage backends for different functions. Each function is backed by a single plug-in, chosen at deployment time.

FunctionAvailable plug-ins
Jobs and files metadata storagePostgreSQL (for production), Redis, Valkey
Priority queue for jobsRedis, Valkey
Event channelsRedis, Valkey
Jobs status updatesRedis, Valkey
File storage for input and output filesS3, Filesystem

Expired batch jobs and their associated files are periodically cleaned up.

Batch Processorโ€‹

The batch processor pulls jobs from a priority queue, retrieves the input files, builds execution plans, and dispatches individual inference requests concurrently for downstream processing. As inference results come back, the processor writes them to an output file, and continuously updates the job's status.

The processor sorts requests by system-prompt hash so that identical-prefix requests hit the inference engine contiguously, keeping cached prefix blocks hot and avoiding eviction-triggered prefill reconstruction. Combined with llm-d's prefix-cache-aware routing, cache reuse extends across the entire serving pool.

The processor listens for job events such as cancellation, enabling real-time control over in-flight work. In addition, the system handles recovery from crashes and failures during processing.

Getting startedโ€‹

To learn more about Batch Gateway, check out the following resources:

Get involved with llm-dโ€‹

Batch Gateway is developed in the open as part of the llm-d ecosystem. If you're running LLM inference at scale and need batch processing capabilities, we'd love to have you involved.