Books

Inference Engineering

A book for engineers who want to understand the technologies that power every AI company and application in the world.

Inference is the most valuable category in AI, but inference engineering is still in its infancy.

Inference engineers work across the stack from CUDA to Kubernetes in pursuit of faster, less expensive, more reliable serving of generative AI models in production. While the potential and impact of inference are becoming clear, the space is young. There are relatively few people working on inference, and newcomers can become experts quickly. There are opportunities to solve novel, interesting, and deeply technical problems at all levels of the stack.

Inference Engineering is your guide to becoming an expert in inference. It contains everything that I’ve learned in four years of working at Baseten. This book is based on the hundreds of thousands of words of documentation, blogs, and talks I've written on inference; interviews with dozens of experts from our engineering team; and countless conversations with customers and builders around the world.

Chapter 0: Inference

Inference Engineering presents a map of the technologies and techniques that power inference across all three layers of runtime, infrastructure, and tooling.

Chapter 1, Prerequisites, covers the product thinking and AI engineering work that need to be done before inference engineering comes into play: use case definition, latency and cost budgeting, and selecting and evaluating which generative AI models to optimize and deploy.

Chapter 2, Models, introduces the technical architecture of AI models – from large language models to image and video generation models – and establishes where the bottlenecks exist for inference with a special focus on optimizing attention.

Chapter 3, Hardware, starts at the spec sheet for modern GPUs and breaks down compute and memory, then disambiguates architectures and SKUs within NVIDIA’s datacenter-grade offerings before briefly surveying other accelerators on the market.

Chapter 4, Software, builds abstractions from CUDA to frameworks like PyTorch, Transformers, and Diffusers and inference engines like vLLM, SGLang, and TensorRT-LLM. It also introduces Dynamo, NVIDIA’s latest system for large-scale distributed model serving.

Chapter 5, Techniques, discusses key model performance optimization techniques adapted from cutting-edge research and applies them in production: quantization, speculative decoding, KV cache re-use, model parallelism, and disaggregation.

Chapter 6, Modalities, expands inference engineering beyond LLMs to voice and visuals. Many types of generative AI models – vision-language models, embedding models, automatic speech recognition (ASR) models, and speech synthesis models – adapt LLM architectures, meaning inference engineers can run them with the same tools and techniques used with LLMs. Image and video generation models have their own architectures and associated performance optimization techniques.

Chapter 7, Production, concludes the book with a rundown of the important problems to solve in operating infrastructure for and building performant applications on optimized model inference services.

Appendices A and B add a glossary of inference engineering terms and a collection of recommended resources for further reading, respectively.

Meet the author

Philip Kiely

Head of AI Education

Philip Kiely joined Baseten in January 2022 and is a software developer and author. He has spoken on inference at conferences like NVIDIA GTC, PyTorch Conference, AI Engineer World’s Fair, and AWS re:Invent. Outside of work, you'll find Philip practicing martial arts, reading a new book, or cheering for his adopted SF sports teams.

Common Questions

Inference is the process of running an AI model like an LLM or image generation model in production to generate output in response to user queries. Inference powers every AI application and is the most important category in the space.

There will be a million inference engineers in a couple of years, with AI labs, AI infrastructure companies, startups, and enterprises alike presenting insatiable demand for inference on open source, fine-tuned, and custom-build AI models.

Today, there are relatively few engineers working on inference, and there are novel, important, and deeply technical problems to solve at every layer of the stack. You have an opportunity to be early on the most important technology of our generation.

I tried my hardest to use AI as an assistant for everything from research to proofreading. While there were a few places it was able to accelerate my work, like appendix alphabetization and fixing inconsistencies in figure numbering, I had to write this book myself and work with talented human designers for the cover and interior graphics to reach the quality bar I was aiming for.

Inference Engineering is written for technically-minded people of all backgrounds who want to deepen their understanding of inference. Anyone can read the first twenty pages, but after that a basic familiarity with coding and computer science concepts will be helpful.

There are dozens of technologies that work together to make inference possible. This book covers a wide range of technologies up and down the stack, though there is a definite focus on NVIDIA GPUs and inference in the datacenter.

Inference Engineering is 256 pages, which is about 46,000 words or about 60,000 tokens. I recommend against trying to feed it to your agent in a single chunk.

Inference Engineering is available in PDF, EPUB, paperback, and hardcover formats. An audiobook version is coming soon.

You can download a free PDF from this website, and you can find me and Baseten at events around SF as well as conferences like NVIDIA GTC and AWS re:Invent to get a free paperback copy. Physical and digital copies are also available for sale on Amazon.

Read Inference Engineering

DIGITAL DOWNLOAD PAPER COPY (WAITLIST)

Inference Engineering

Inference is the most valuable category in AI, but inference engineering is still in its infancy.

Chapter 0: Inference

Philip Kiely

Common Questions

What is inference?

Why should I study inference?

Was AI used to write the book?

Do I need to be an engineer to read it?

Does it focus on one technology?

How long is the book?

What formats are available?

Where can I get a copy?

Read Inference Engineering