2 posts tagged with "Inference"

Efficient Batch and Interactive LLM Inference at Scale with llm-d

June 12, 2026 · 5 min read

Lior Aronovich

Senior Principal Software Engineer, Red Hat

Raymond Zhao

Principal Software Engineer, Red Hat

Jooyeon Mok

Software Engineer, Red Hat

Nili Guy

R&D Manager, AI Infrastructure, IBM

As organizations deploy AI applications in production, their inference infrastructure must serve two fundamentally different workloads simultaneously: interactive requests requiring immediate replies, and batch inference jobs that process thousands of requests with a time tolerance of hours for receiving results. Use cases for batch inference include autonomous background agents performing multi-step reasoning and deep research, as well as user-initiated workloads like offline evaluations, dataset processing, and embedding generation.

In batch inference, the goal is to maximize throughput across a large volume of requests while meeting defined completion time targets, without interfering with interactive traffic. Non-urgent inference can fill GPU capacity during periods of lower interactive traffic, increasing infrastructure utilization. Users can also take advantage of differential billing between batch and interactive workloads for cost-optimized processing.

Batch Gateway brings first-class batch inference capabilities to llm-d. It provides an OpenAI-compatible API for submitting, tracking, and managing large-scale batch jobs, running efficiently alongside interactive inference workloads on shared infrastructure. With OpenAI API compatibility, users can migrate existing OpenAI batch scripts with minimal changes.

Predicted-Latency Based Scheduling for LLMs

March 13, 2026 · 28 min read

Kaushik Mitra

Software Engineer, Google

Benjamin Braun

Software Engineer, Google

Abdullah Gharaibeh

Senior Staff Software Engineer, Google

Clayton Coleman

Distinguished Engineer, Google

Not all LLM requests cost the same. A short prompt might complete in milliseconds, while a long one can occupy a GPU for seconds. If we can predict how long a request will take on each candidate server before dispatching it, we can make substantially better routing decisions. This post describes a system that does exactly that: a lightweight ML model trained online from live traffic that replaces manually tuned heuristic weights with direct latency predictions.