Lead Developer Advocate

Philip Kiely

Why GPU utilization matters for model inference

Save money on high-traffic model inference workloads by increasing GPU utilization to maximize performance per dollar for LLMs, SDXL, Whisper, and more.

Marius Killinger

1 other

Prompt: A retrofuturistic pickup truck loaded with green plants on a sunny highway

ML models

The best open source large language model

Explore the best open source large language models for 2025 for any budget, license, and use case.

Philip Kiely

Prompt: A sleek orange robot hoising a trophy on top of a mountain.

Model performance

Unlocking the full power of NVIDIA H100 GPUs for ML inference with TensorRT

Double or triple throughput at same-or-better latencies by switching to H100 GPUs from A100s for model inference with TensorRT/TensorRT-LLM.

Pankaj Gupta

1 other

Prompt: a retro rocket ship taking off on the beach at sunrise. Model: Playground 2

Glossary

Introduction to quantizing ML models

Quantizing ML models like LLMs makes it possible to run big models on less expensive GPUs. But it must be done carefully to avoid quality reduction.

Abu Qader

1 other

Prompt: A steampunk microscope in a lab run by lord of the rings elves. Model: Playground 2

Glossary

How to benchmark image generation models like Stable Diffusion XL

Benchmarking Stable Diffusion XL performance across latency, throughput, and cost depends on factors from hardware to model variant to inference config.

Philip Kiely

Prompt: a sleek bus driving through the mountains. Model: Playground 2

Glossary

Understanding performance benchmarks for LLM inference

This guide helps you interpret LLM performance metrics to make direct comparisons on latency, throughput, and cost.

Philip Kiely

Prompt: Two racecars on the beach at sunset. Model: Playground 2.

Model performance

Faster Mixtral inference with TensorRT-LLM and quantization

Mixtral 8x7B structurally has faster inference than similarly-powerful Llama 2 70B, but we can make it even faster using TensorRT-LLM and int8 quantization.

Pankaj Gupta

2 others

Prompt: An illustration of a face divided in half. Half the face is Marie Curie, the other half of the face is Einstein. Model: Playground v2.

ML models

Playground v2 vs Stable Diffusion XL 1.0 for text-to-image generation

Playground v2, a new text-to-image model, matches SDXL's speed & quality with a unique AAA game-style aesthetic. Ideal choice varies by use case & art taste.

Philip Kiely

Model: Playground v2. Prompt: The meaning of life.

1…4 5 6 7

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.