Skip to main content

2 posts tagged with "Inference"

LLM inference serving and optimization

View All Tags

Efficient Batch and Interactive LLM Inference at Scale with llm-d

· 5 min read
Lior Aronovich
Lior Aronovich
Senior Principal Software Engineer, Red Hat
Raymond Zhao
Raymond Zhao
Principal Software Engineer, Red Hat
Jooyeon Mok
Jooyeon Mok
Software Engineer, Red Hat
Nili Guy
Nili Guy
R&D Manager, AI Infrastructure, IBM

As organizations deploy AI applications in production, their inference infrastructure must serve two fundamentally different workloads simultaneously: interactive requests requiring immediate replies, and batch inference jobs that process thousands of requests with a time tolerance of hours for receiving results. Use cases for batch inference include autonomous background agents performing multi-step reasoning and deep research, as well as user-initiated workloads like offline evaluations, dataset processing, and embedding generation.

In batch inference, the goal is to maximize throughput across a large volume of requests while meeting defined completion time targets, without interfering with interactive traffic. Non-urgent inference can fill GPU capacity during periods of lower interactive traffic, increasing infrastructure utilization. Users can also take advantage of differential billing between batch and interactive workloads for cost-optimized processing.

Batch Gateway brings first-class batch inference capabilities to llm-d. It provides an OpenAI-compatible API for submitting, tracking, and managing large-scale batch jobs, running efficiently alongside interactive inference workloads on shared infrastructure. With OpenAI API compatibility, users can migrate existing OpenAI batch scripts with minimal changes.

Predicted-Latency Based Scheduling for LLMs

· 28 min read
Kaushik Mitra
Software Engineer, Google
Benjamin Braun
Software Engineer, Google
Abdullah Gharaibeh
Senior Staff Software Engineer, Google
Clayton Coleman
Distinguished Engineer, Google

Not all LLM requests cost the same. A short prompt might complete in milliseconds, while a long one can occupy a GPU for seconds. If we can predict how long a request will take on each candidate server before dispatching it, we can make substantially better routing decisions. This post describes a system that does exactly that: a lightweight ML model trained online from live traffic that replaces manually tuned heuristic weights with direct latency predictions.