Baseten Blog | Page 2
Driving model performance optimization: 2024 highlights
Baseten's model performance team works to optimize customer models for latency, throughput, quality, cost, features, and developer efficiency.
New observability features: activity logging, LLM metrics, and metrics dashboard customization
We added three new observability features for improved monitoring and debugging: an activity log, LLM metrics, and customizable metrics dashboards.
How we built production-ready speculative decoding with TensorRT-LLM
Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.
A quick introduction to speculative decoding
Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.
Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference
Our new Speculative Decoding integration can cut latency in half for production LLM workloads.
Generally Available: The fastest, most accurate and cost-efficient Whisper transcription
At Baseten, we've built the most performant (1000x real-time factor), accurate, and cost-efficient speech-to-text pipeline for production AI audio transcription
Introducing Custom Servers: Deploy production-ready model servers from Docker images
Deploy production-ready model servers on Baseten directly from any Docker image using just a YAML file.
Create custom environments for deployments on Baseten
Test and deploy ML models reliably with production-ready custom environments, persistent endpoints, and seamless CI/CD.
Introducing canary deployments on Baseten
Our canary deployments feature lets you roll out new model deployments with minimal risk to your end-user experience.