Microsoft AI Research Unveils DeepSpeed-FastGen: Elevating LLM Serving Efficiency with Innovative Dynamic SplitFuse Technique
Large language models (LLMs) have revolutionized various AI-infused applications, from chat models to autonomous driving. This evolution has spurred the need for systems that can efficiently deploy and serve these models, especially under the increasing demand for handling long-prompt workloads. The major hurdle in this domain has been balancing high throughput and low latency in…
