
A new development in large language model (LLM) inference has been introduced with the MARLIN kernel, which significantly speeds up the process. MARLIN achieves a near-optimal speedup of 3.87x for batch sizes up to 32 on NVIDIA A10 GPUs. The kernel supports 2:4 sparsity and is integrated with the vLLM project. This advancement is led by Elias Frantar and Roberto L. Castro, with contributions from Neural Magic. The design of MARLIN focuses on mixed-precision auto-regressive parallel inference, providing up to 4x acceleration for batch sizes between 16 and 32, and notable improvements for larger batches.
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models. https://t.co/SZx5S4Q4jo
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models ◼ The MARLIN kernel design achieves impressive speedups for batched LLM inference on GPUs. With up to 4× acceleration for batch sizes 16-32 and significant gains even for larger batches, MARLIN… https://t.co/XU2cYfjmqA
Happy to release the write-up on the MARLIN kernel for fast LLM inference, now supporting 2:4 sparsity! Led by @elias_frantar & @RobertoL_Castro Paper: https://t.co/lT6EtMoyEY Code: https://t.co/r58fIm8zWB MARLIN is integrated with @vllm_project thanks to @neuralmagic!