Accelerating inference with NVIDIA B200 GPUs
In the past year, AI inference workloads have become substantially more demanding. LLMs are larger than ever, with DeepSeek-R1 tipping the scales at 671 billion parameters and Meta’s upcoming Llama 4 Behemoth model promising to be three times larger.
As inference has become mission-critical for AI-native products, latency, throughput, and cost-efficiency have become essential for running applications like code generation, search, reasoning agents, and more in production.
To meet this demand, we introduced inference on NVIDIA B200 GPUs on Baseten. With B200s, our earliest users are already seeing:
5x higher throughput for high-traffic endpoints
More than 50% lower cost per token with throughput-optimized deployments
Up to 38% lower latency for serving the largest LLMs like DeepSeek-R1
In this piece, we will detail the technical advantages of B200 GPUs and the new use cases they unlock, from frontier LLMs to demanding workloads like video generation. NVIDIA B200 GPUs are available today on Baseten — contact us to get started running your workloads with B200s!
Performance boosts using B200s
B200 GPUs are based on NVIDIA’s current-generation Blackwell architecture and can replace Hopper GPUs (the H100 and H200) for a wide range of workloads.
Before the B200, AI-native companies had limited options for high-throughput deployments of models like DeepSeek-R1. They could use H200s, which are more expensive and less available, but can fit the model in a single node, or connect multiple sets of smaller H100 GPUs together using multi-node inference. Now, with B200s, developers have a cost-efficient and high-performance solution for large reasoning models.
Llama 4 Scout showcases how B200 GPUs can be a game-changer even for smaller models. While much smaller than DeepSeek-R1 at 109B parameters, Llama 4 Scout has a 10 million token context window, which requires massive amounts of memory to run efficiently. With H100 and H200 GPUs, developers could only serve a fraction of the context window. But with B200, the full context window is supported.
Comparing B200 hardware specs
The B200 GPU’s first advantage is better raw specs. Whether you’re running compute-bound prefill or bandwidth-bound decode, B200 offers out-of-the-box improvements over previous-generation options like the NVIDIA H200 GPU.
Using fast inference frameworks with B200s
Fast inference frameworks like TensorRT-LLM, SGLang, and vLLM all support inference on B200 GPUs, compounding the hardware advantage with the latest model performance techniques.
While each framework has its own pros and cons, TensorRT-LLM is particularly well-positioned to take immediate advantage of B200 architectural features. Developed in-house by NVIDIA, TensorRT-LLM compiles inference code to optimized CUDA kernels for maximum performance. With Baseten’s TensorRT-LLM Engine Builder, you can unlock this outstanding performance for popular models with a simple configuration file.
SGLang and vLLM are also excellent options for B200, with SGLang excelling at running DeepSeek and Qwen models with outstanding performance, and vLLM’s broad support for a wide range of models and architectures.
Testing FP4 quantization for efficient, accurate inference
For the last two generations of GPU architectures, we’ve used FP8 quantization for efficient inference with minimum quality loss. With Blackwell, NVIDIA introduced FP4 quantization, which promises even more gains in throughput and improvements to latency.
Like FP8, FP4 is a floating-point format designed to improve quantization accuracy by preserving dynamic range during the quantization process. By retaining designated exponent bits, FP4 is able to represent a wider range of values than INT4, a 4-bit integer format.
Inference on B200s in production
While the B200 is an outstanding GPU, securing great hardware is only one step in creating a high-performance production inference service. Building a great inference platform on top of B200 GPUs takes four important components:
Model performance optimization: while B200s are powerful out of the box, efficiently leveraging fast inference frameworks and model performance techniques is essential for delivering on the promise of low-latency, cost-effective model serving at scale.
Distributed GPU infrastructure: serving high-volume workloads on B200s requires accessing GPUs from multiple cloud service providers and regions while seamlessly sharing traffic across resources.
Model management tooling: switching to B200s requires careful testing and validation ahead of production. Use detailed logs, performance metrics, and our new instance type per deployment feature to validate changes ahead of swapping production traffic.
AI engineering expertise: getting to value with any new hardware requires expertise to navigate the cutting edge of inference tooling. Baseten’s engineers have extensive experience working with B200 GPUs to create fast and stable deployments.
Get started with B200 GPUs today
Baseten has a massive supply of B200 GPUs ready to serve large-scale production workloads. Whether you’re building code generation with Qwen or reasoning agents with DeepSeek, we can accelerate your inference workloads for huge volumes of traffic. You can view list pricing for B200 GPUs on our pricing page, and we offer competitive volume discounts.
To get started with B200, let us know a bit about your workload and we’ll get you set up right away!
Subscribe to our newsletter
Stay up to date on model performance, GPUs, and more.