Skip to main content

One post tagged with "Batch Inference"

Batch inference processing for LLM workloads

View All Tags

Efficient Batch and Interactive LLM Inference at Scale with llm-d

· 5 min read
Lior Aronovich
Lior Aronovich
Senior Principal Software Engineer, Red Hat
Raymond Zhao
Raymond Zhao
Principal Software Engineer, Red Hat
Jooyeon Mok
Jooyeon Mok
Software Engineer, Red Hat
Nili Guy
Nili Guy
R&D Manager, AI Infrastructure, IBM

As organizations deploy AI applications in production, their inference infrastructure must serve two fundamentally different workloads simultaneously: interactive requests requiring immediate replies, and batch inference jobs that process thousands of requests with a time tolerance of hours for receiving results. Use cases for batch inference include autonomous background agents performing multi-step reasoning and deep research, as well as user-initiated workloads like offline evaluations, dataset processing, and embedding generation.

In batch inference, the goal is to maximize throughput across a large volume of requests while meeting defined completion time targets, without interfering with interactive traffic. Non-urgent inference can fill GPU capacity during periods of lower interactive traffic, increasing infrastructure utilization. Users can also take advantage of differential billing between batch and interactive workloads for cost-optimized processing.

Batch Gateway brings first-class batch inference capabilities to llm-d. It provides an OpenAI-compatible API for submitting, tracking, and managing large-scale batch jobs, running efficiently alongside interactive inference workloads on shared infrastructure. With OpenAI API compatibility, users can migrate existing OpenAI batch scripts with minimal changes.