New LLM Serving Method Faster Than Old Way

New tests show a disaggregated LLM serving method is 2x faster than older methods using fewer resources. This means AI services will work better.

Disaggregated serving, a newer approach to running large language models, is showing promise in offering more efficient and predictable performance than traditional aggregated models. Recent tests on Oracle Cloud Infrastructure (OCI) using AMD MI300X hardware suggest this disaggregated model, powered by llm-d, can deliver a better user experience for enterprise AI services. The core idea is moving beyond basic inference to a platform geared for real-world production demands.

Disaggregation Outperforms Aggregation in Tests

Initial findings indicate that a disaggregated setup using llm-d on OCI with AMD MI300X GPUs achieved a lower end-to-end request latency compared to a larger aggregated deployment. Specifically, a setup with two nodes and 16 GPUs in a disaggregated configuration outperformed a four-node, 32-GPU aggregated setup in terms of sustained request rates. This points towards a more optimized use of resources, where prefill and decoding phases of LLM inference are handled separately, avoiding the performance bottlenecks of traditional continuous batching.

Read More: Microsoft Offers Early Retirement and Boosts AI Tools

From Demo to Production: Rethinking optimized LLM Inference at Scale with llm-d on OCI - 1
  • llm-d is designed as a Kubernetes-native serving stack, aiming for high-performance, distributed inference.

  • The platform provides orchestration over model servers like vLLM, using features such as prefix-cache aware routing and utilization-based load balancing.

  • Recent releases, like llm-d 0.3, have focused on making disaggregation support more portable and improving throughput, reaching significant token-per-second rates on multi-node clusters.

Caching Strategies Tackle Cost and Latency

Beyond serving architecture, advancements in caching are also key to optimizing LLM inference. Techniques like KV cache reuse, where data from previous computations is stored and reused, can dramatically cut down costs and improve latency.

  • IBM Storage Scale is being explored as a backend for storage-backed KV caching, aiming to provide near-DRAM latency benefits without the high cost of in-memory caches. This approach promises a more favorable balance between performance and economics.

  • Other caching strategies include exact match and semantic match, where prompts and their embeddings are stored. This allows systems to retrieve previously generated responses for similar queries, reducing computation and response times.

A Growing Ecosystem for Efficient LLM Serving

The push for optimized LLM inference is fostering a diverse ecosystem of tools and frameworks. llm-d positions itself as a Kubernetes-native solution, integrating with popular model servers like vLLM and offering advanced orchestration for scalable, observable inference.

From Demo to Production: Rethinking optimized LLM Inference at Scale with llm-d on OCI - 2

Other frameworks contributing to this space include:

  • AIBrix: An orchestration and control plane for cloud-native LLM serving.

  • SGLang: Focuses on programmable control for complex LLM workflows.

  • Hugging Face TGI (Text Generation Inference): A production-ready serving platform favored by enterprises.

These developments highlight a broader industry effort to bridge the gap between LLM research and production-ready deployments, addressing challenges like memory bandwidth bottlenecks and the computational intensity of prefill phases.

Frequently Asked Questions

Q: What is the new way to run AI models called?
The new way is called 'disaggregated serving'. It is different from the older 'aggregated' way. This new method is showing it can run AI models better and faster.
Q: Why is disaggregated serving better than aggregated serving?
Tests show disaggregated serving is faster and uses fewer computer parts. For example, a setup with 16 computer chips was faster than a setup with 32 chips. This means AI services can respond quicker.
Q: How does disaggregated serving work differently?
It handles different parts of AI model tasks separately. This avoids slowing down the whole system, which can happen with older methods. It uses tools like 'llm-d' to manage this.
Q: Are there other ways to make AI models run better?
Yes, new ways to store and reuse data, called 'caching', also help. This can make AI responses quicker and cost less money for companies. IBM is testing a new way to do this with their storage system.
Q: Who is developing these new AI serving methods?
Many companies are working on this. Tools like 'llm-d', 'AIBrix', 'SGLang', and 'Hugging Face TGI' are helping to make AI run better in real-world business uses.