PyTorch Conference

We look forward to seeing you at PyTorch!

Experience Baseten’s AI inference platform, offering industry-leading performance, security, and scale for organizations building AI products.

Visit us at Booth P2 for a demo and to get your "Artificially Intelligent" T-shirt!
Join us for lunch on October 23 at our workshop with Nvidia: Dynamo & Dine. Save your spot here.
And don't miss our talk:
Low-Precision Inference without Quality Loss: Selective Quantization & Microscaling
Speakers: Pankaj Gupta (Co-founder, Baseten) & Philip Kiely (Head of Developer Relations, Baseten)
Wednesday October 22, 2025
3:50pm - 4:15pm PDT
Room 2005 - 2007
Everyone wants faster inference, but no one wants to compromise the quality of their model outputs. FP8 quantization offers 30-50% lower latencies for inference on large models, but must be applied carefully to maintain quality. Recently, NVIDIA Blackwell GPUs introduced new microscaling number formats (MXFP8, MXFP4, NVFP4) and new kernel options for low-precision inference. In this talk, Baseten inference engineers will cover practical applications of quantization to quality-sensitive inference tasks with a focus on selecting which parts of the inference system to quantize (weights, activations, KV cache, attention) and how microscaling number formats help preserve dynamic range.

Related resources