Lead Developer Advocate
Why GPU utilization matters for model inference
Save money on high-traffic model inference workloads by increasing GPU utilization to maximize performance per dollar for LLMs, SDXL, Whisper, and more.
The best open source large language model
Explore the best open source large language models for 2025 for any budget, license, and use case.
Unlocking the full power of NVIDIA H100 GPUs for ML inference with TensorRT
Double or triple throughput at same-or-better latencies by switching to H100 GPUs from A100s for model inference with TensorRT/TensorRT-LLM.
Introduction to quantizing ML models
Quantizing ML models like LLMs makes it possible to run big models on less expensive GPUs. But it must be done carefully to avoid quality reduction.
How to benchmark image generation models like Stable Diffusion XL
Benchmarking Stable Diffusion XL performance across latency, throughput, and cost depends on factors from hardware to model variant to inference config.
Understanding performance benchmarks for LLM inference
This guide helps you interpret LLM performance metrics to make direct comparisons on latency, throughput, and cost.
Faster Mixtral inference with TensorRT-LLM and quantization
Mixtral 8x7B structurally has faster inference than similarly-powerful Llama 2 70B, but we can make it even faster using TensorRT-LLM and int8 quantization.
Playground v2 vs Stable Diffusion XL 1.0 for text-to-image generation
Playground v2, a new text-to-image model, matches SDXL's speed & quality with a unique AAA game-style aesthetic. Ideal choice varies by use case & art taste.