Boosting MTP acceptance in TensorRT-LLM: +40% throughput

Two common approaches are widely adopted for speculative decoding:

N-gram speculation, which predicts the next token based on fixed-length patterns from recent context.
Draft model speculation, like EAGLE or multi-token prediction (MTP), which uses a smaller, faster neural network to predict several tokens ahead.

At Baseten, we developed a hybrid method to effectively batch the token verification phase, significantly reducing latency for individual requests. Our method combines a suffix automaton—an advanced form of n-gram lookup—with an MTP/EAGLE draft model. This approach is particularly impactful for applications like code generation, where long, repetitive patterns are common.

We’ve integrated our technique into TensorRT-LLM, one of the fastest inference engines, and achieved these speedups with zero added overhead, making it suitable for production deployments. On production agentic coding workloads, we see up to 40% higher throughput at equal latency and up to 40% lower latency at equal throughput, compared to MTP in TensorRT-LLM alone.

30% higher throughput using Baseten's hybrid MTP + SA approach

Throughput per single request without speculative decoding, with multi-token prediction (MTP) in TensorRT-LLM, and with Baseten’s hybrid MTP + suffix automaton (SA) approach. Testing with nvidia/DeepSeek-V3.1-NVFP4 on the dataset glaiveai/code-edit-samples, we see 30%-33% higher acceptance lengths and throughput across different batch sizes using our hybrid approach than TRT-LLM MTP alone.

34% higher acceptance rates with Baseten's hybrid MTP + SA decoding approach

Average acceptance length without speculative decoding, with TensorRT-LLM MTP, and with Baseten’s MTP + SA hybrid approach. Testing with nvidia/DeepSeek-V3.1-NVFP4 on the dataset glaiveai/code-edit-samples, we see 34% higher acceptance lengths using our hybrid approach than TRT-LLM MTP alone.

Suffix automaton decoding

This approach improves upon n-gram lookup decoding by using a suffix automaton (SA) for prediction lookups.

Unlike the fixed-size pattern matching available in vLLM and TensorRT-LLM’s n-gram speculative decoding, SA decoding identifies arbitrarily long patterns and selects the longest possible match. Additionally, the suffix automaton is updated in real time during generation, resulting in higher acceptance rates on long sequences.

Combining MTP and SA Decoding

SA Decoding shines at code generation, where the accept length is 10+ with long context, but performs poorly on reasoning and other writing tasks, with accept rates near 0. Meanwhile, MTP produces consistent speed-ups across all domains, though the accept rate is usually only 2-4 tokens per iteration.

Baseten’s Speculation Engine combines both approaches: if the SA finds a match longer than a threshold then the SA match is used, otherwise MTP is used.

Testing on DeepSeek-V3.1-NVFP4, this achieves a significant speedup in TensorRT-LLM. The level of speedup depends heavily on the task, with a more pronounced increase on agentic coding and math tasks. We commonly see up to 40% improvements on coding applications for production workloads.

Integrating with TensorRT-LLM

To integrate with the TRT-LLM runtime, requests are processed as follows:

The suffix automaton for the initial prompt is constructed on the host, overlapping with the KV-cache prefill on the device.
Before the first generation step, the automaton state is transferred to the device.
During generation, the suffix automaton is updated directly on the device, without introducing additional synchronization points.

The suffix automaton itself is a highly efficient data structure, with an amortized runtime complexity of O(1) per update. As a result, by carefully scheduling its construction and updates to avoid new synchronization points, near-zero overhead is achieved.

How Baseten’s hybrid MTP + suffix automaton decoding is integrated into TensorRT-LLM across the context and decode loops. During context setup, KV prefill runs on the GPU while the initial suffix automaton state is built on the host, then transferred to the device. In the decode loop, SA states are updated in parallel on the GPU alongside MTP draft sampling and verification, enabling higher throughput with no added synchronization overhead.

To support this design, we built a Python API that exposes three core operations (check out the full API in the repository here):

add_request(request_id: int, prompt: list[int]): builds a suffix automaton state on the host.
prepare(request_ids: list[int]): prepares a GPU batch, copying newly created suffix automaton states to the device.
extend(draft_tokens_out: tensor, accepted_tokens_in: tensor): a CUDA-kernel, CUDA-graph-compatible operation that updates a batch of suffix automaton states on the GPU and returns SA draft tokens and match lengths (i.e., confidence scores). A batch of N requests is updated in parallel by launching a grid with one block per batch slot and 1 thread each.

extend is called before draft sampling to compute match lengths from the suffix automaton. These match lengths are compared against a threshold to decide how many draft tokens come from suffix-automaton continuations versus multi-token prediction sampling.

To achieve high performance while avoiding platform-specific core logic, suffix automaton states are represented as plain old data (POD) structs, and the core algorithm is implemented in a header-only implementation that compiles for both C++ and CUDA. This allows the suffix automaton logic to be written once and run on both the CPU and GPU. Additionally, the use of POD structs enables efficient, low-overhead data transfer between host and device.

This implementation demonstrates the high level of interoperability between C++ and CUDA. For example, we implement a POD graph structure along with a dynamic hash map for storing suffix automaton states, both of which are fully C++ and CUDA-compatible. By embracing POD structs, all we need for CUDA support is a C++ smart pointer type with CUDA specializations for malloc and memcpy. Finally, we achieve torch stream capture (CUDA graph) compatibility by specializing the smart pointer’s memcpy and extend() invocations to run on the active torch stream.

As a result, the profile output for the decode phase shows minimal idle GPU time.

NVIDIA Nsight™ Systems profile for 10 generation iterations with DeepSeek v3.1, i.e., the time to generate 30 tokens with an acceptance rate of 3. The orange blocks are individual CUDA graph invocations, each representing a forward pass. Non-optimized systems show gaps between these blocks, e.g. when the CPU is planning, and the GPU is idle. Note that there are no gaps in this graph; the GPU is fully utilized.

The absence of latency overhead was verified by setting the SA threshold to infinity, effectively disabling SA predictions while still executing the computation, and confirming that end-to-end latency matches that of baseline MTP. This demonstrates that this implementation introduces zero overhead.

Areas for further work

By augmenting draft model speculation with suffix automaton-based lookups, we achieved up to 40% efficiency gains on existing workloads that use speculative decoding. These improvements are orthogonal to other inference optimizations and, in many cases, provide additional speedups on top of existing MTP/EAGLE deployments without requiring any changes to configuration parameters such as draft length.

Looking ahead, we find several directions for future work particularly promising:

Continuous draft model training alongside inference, enabling the draft models to adapt to evolving workloads and further improve acceptance rates.
Dynamic-length speculation, where the draft length is adjusted based on speculation confidence on a per-request, per-micro-batch basis.

For more information on incorporating our hybrid MTP/SA decoding approach into your workloads, get in touch with our engineers.

Several related efforts were developed independently and explore overlapping ideas. While the implementations and goals differ, they share core concepts with the approach described here and are therefore worth highlighting:

Deploy SA with lookahead decoding today on Baseten

You can use lookahead decoding with any LLM supported in our Engine Builder. Check out our docs for examples and best practices on deploying models using SA with lookahead decoding.

SAM Decoding: Speculative Decoding via Suffix Automaton

A comprehensive academic treatment of suffix automaton-based speculative decoding, providing useful background and formal analysis.

Suffix-tree decoding in vLLM (agentic workflows)

An exploration of suffix-tree-based speculation in vLLM, with a particular focus on repetitive, agent-driven workloads.

DFlash: Block Diffusion for Flash Speculative Decoding

Explores a novel architecture for the draft model.