Accelerating Inference: How Brium is Shaping the Future of AI across GPU Architectures
In recent years, the hardware industry has made strides towards providing viable alternatives to NVIDIA hardware for server-side inference. Solutions such as AMD’s Instinct GPUs, offer strong performance characteristics, but it remains a challenge to harness that performance in practice as workloads are typically tuned extensively with NVIDIA GPUs in mind. At Brium, we intend to enable efficient LLM inference across a range of hardware architectures.
We’re dedicated to enabling ML applications on a diverse set of architectures and unlocking the hardware capabilities through engineering choices, made at every level of stack — from model inference systems, through runtime systems and ML frameworks, to compilers.
On the software side, Long-context Large Language Model (LLM) inference has become crucial for applications ranging from video understanding to retrieval-augmented generation, code assistance and even novel chain-of-thought approaches that enhance model accuracy.
Let’s look at Brium with a real world example. Below we will compare Brium’s inference platform with several popular inference serving solutions, VLLM and SGLang running on AMD’s MI210 and MI300 series. This shows how Brium’s stack can translate to improved throughput, as well as reduced latency. In inference tasks, lower latency enhances application responsiveness, while higher throughput can reduce the total cost of ownership (TCO) of an inference system. Achieving both may also open the door to new AI applications.
Showcasing Real-World Performance
Here’s an example where we prompt Llama 3.1-8B to come up with novel verses in Shakespearean style by providing a large context of similar work.
On the left, the inference is running through our stack, and on the right, we’re running SGLang.
You can easily spot the difference in responsiveness! Brium inference platform finishes serving the request in less than half the time on MI210 whilst producing an identical output.
Brium Inference on MI210 generating tokens:
Comparison of Inference Times:
- Brium
- TTFT: 1m23s
- Total Time: 3m12s
- SGLang
- TTFT: 2m26s
- Total Time: 7m50s
Benchmarking
The Brium team has been working on pushing the boundaries of long context LLM performance on several GPU architectures, optimizing every layer of the inference stack. Here we demonstrate results on AMD’s GPUs.
Brium’s platform is competitive with the best of the alternatives for small sequence lengths. As the sequence length increases, our advantage increases on both Time To First Token (TTFT) and throughput while VLLM and SGLang trade places with one another depending on the metric. Our advantage is greater on more capable hardware such as MI300.
Note: We have collected the benchmarking data via scripts in sglang’s repository (bench_serving.py) by having an OpenAI_API server serving the requests in each case. Inputs were served in a deterministic manner all at once (measuring offline capabilities). We normalize settings across all frameworks to make comparisons fair including disabling prefix caching, setting the same prefill chunking size, no quantization, no compression of any kind.
Through our efforts, we have attained cutting-edge performance in both throughput and Time To First Token. This translates to improved return on investment (ROI) on hardware, with the potential to minimize total cost of ownership by reducing the number of GPUs necessary to achieve the same outcome.
Additionally, these advancements could unlock the future of AI applications, opening the door to novel use cases.
Conclusion
At Brium, we’re excited about the future of LLM inference and the possibilities it unlocks. Our commitment to innovation and optimization across diverse hardware architectures is unwavering. We believe that this approach will not only drive efficiency and cost-effectiveness but also pave the way for new AI applications. Stay connected with us as we continue to push the boundaries of what’s possible in the world of AI on a variety of hardware.
For more information, contact us at contact@brium.ai
~Brium Team