Baseten Blog | Page 1
Accelerating inference with NVIDIA B200 GPUs
NVIDIA B200 GPUs improve cost, throughput, and latency for use cases like code generation, search, reasoning, agents, and more.
Building performant embedding workflows with Chroma and Baseten
Integrate Chroma’s open-source vector database with Baseten’s fast inference engine for efficient, real-time embedding inference in your AI-native apps.
The best open-source embedding models
Discover the best open-source embedding models for search, RAG, and recommendations—curated picks for performance, speed, and cost-efficiency.
How we built high-throughput embedding, reranker, and classifier inference with TensorRT-LLM
Discover how we optimized embedding, reranker, and classifier inference using TensorRT-LLM, doubling throughput and achieving ultra-low latency at scale.
Introducing Baseten Embeddings Inference: The fastest embeddings solution available
Baseten Embeddings Inference (BEI) delivers 2x higher throughput and 10% lower latency for production embedding, reranker and classification models at scale.
Announcing Baseten’s $75M Series C
Baseten raised a $75M Series C to power mission-critical AI inference for leading AI companies.
How multi-node inference works for massive LLMs like DeepSeek-R1
Running DeepSeek-R1 on H100 GPUs requires multi-node inference to connect the 16 H100s needed to hold the model weights.
Testing Llama 3.3 70B inference performance on NVIDIA GH200 in Lambda Cloud
The NVIDIA GH200 Superchip combines an NVIDIA Hopper GPU with an ARM CPU via high-bandwidth interconnect
Baseten Chains is now GA for production compound AI systems
Baseten Chains delivers ultra-low-latency compound AI at scale, with custom hardware per model and simplified model orchestration.
Private, secure DeepSeek-R1 in production in US & EU data centers
Dedicated deployments of DeepSeek-R1 and DeepSeek-V3 offer private, secure, high-performance inference that's cheaper than OpenAI