Canopy Labs selects Baseten as preferred inference provider for Orpheus TTS models

Canopy Labs, a foundation model company operating in San Francisco, is on a mission to build digital humans that are indistinguishable from real humans. In March, they released Orpheus TTS, an open-source model for real-time lifelike speech synthesis.

Orpheus went viral, racking up over 100K downloads on Hugging Face as a top 5 trending model. As thousands of developers experimented with Orpheus, many began to ask how to run the model in production.

To help developers use Orpheus TTS in production, Canopy Labs has selected Baseten as its preferred inference provider. Baseten and Canopy have collaborated to create the world’s highest-performance Orpheus inference server based on NVIDIA’s TensorRT-LLM.

We’re excited to partner with Baseten to optimize and serve Orpheus TTS for demanding AI applications like real-time voice agents.

Amu Varma, Co-Founder, Canopy Labs

Below, we’ll dive into benchmarks, optimizations, and client code for Orpheus on Baseten. You can deploy the model in a couple of clicks from Baseten’s model library to get started building with Orpheus instantly.

Benchmarking real-time speech synthesis

Benchmarking performance for TTS models is interesting because you’re not as concerned with raw tokens per second as with LLMs. For Orpheus, 83 tokens is roughly equivalent to one second of audio. So once you can get to 83 tokens per second, you actually want to increase batch size rather than token speed to support more real-time connections on the same hardware, reducing cost.

Our original implementation of Orpheus supported 7 simultaneous real-time streams on an H100 MIG GPU – effectively half of an H100 for cost-efficient performance. After optimizing the model with TensorRT-LLM, we see a base rate of 16 simultaneous streams or up to 24 for applications with stable traffic patterns.

Baseten’s TensorRT-LLM implementation runs up to 3.5x more simultaneous real-time streams

The other key metric to optimize for is time to first byte (TTFB). The first byte of response in streaming output gives users the sense of instant response. Baseten’s Orpheus implementation achieves the gold standard of a 200-millisecond TTFB on H100 MIG, equal to the average human reaction time. On a full H100, we get below 150 milliseconds.

With a full H100 GPU, Baseten's TTFB for Orpheus drops to <150 milliseconds

There are also non real-time use cases for Orpheus, like generating podcasts, audiobooks, YouTube videos, and more. In these cases, cost-efficient throughput is more important than individual request latency. By increasing the batch size as high as 128 concurrent requests, you can double your token throughput for latency-insensitive tasks, effectively cutting costs in half again.

Optimizing Orpheus TTS performance with TensorRT-LLM

How did we achieve these performance outcomes? The key strategy was applying LLM-specific tooling to a new modality.

Orpheus TTS is based on the Llama 3.2 3B model, a familiar architecture for Baseten’s model performance tooling. Using the TensorRT-LLM Engine Builder, we were able to apply multiple optimizations, including FP8 quantization, chunked context, and llama-specific CUDA optimizations to improve performance.

We selected the H100 MIG GPU as the best hardware option for cost-efficient Orpheus inference. An H100 MIG is essentially half of an H100 GPU, with 40 GB of VRAM and 3/7 compute partitions. With its Hopper architecture, H100 MIGs offer excellent performance with TensorRT-LLM and support efficient FP8 quantization.

As Orpheus is such a small model, with only three billion parameters, the MIG’s reduced VRAM is still more than enough to hold model weights and run large batches. While the full H100 may offer marginally better performance, for these workloads it makes more sense to split the GPU in half and replicate across each half independently.

Client code for real-time speech

Calling speech synthesis models, especially in real-time applications, is not as standardized as other modalities. But client code is essential for high-performance voice agents, in fact, it’s not uncommon for subtle errors in model calling to add hundreds of milliseconds of latency.

We’ve written some sample client code designed to get you started and avoid common pitfalls.

1import asyncio
2import aiohttp
3import uuid
4import time
5import os
6from concurrent.futures import ProcessPoolExecutor
7
8# Configuration
9MODEL = "abcd1234"
10BASETEN_HOST = f"https://model-{MODEL}.api.baseten.co/environments/production/predict"
11BASETEN_API_KEY = os.environ["BASETEN_API_KEY"]
12PAYLOADS_PER_PROCESS = 5000
13NUM_PROCESSES = 8
14MAX_REQUESTS_PER_PROCESS = 1
15
16# Sample prompts
17prompts = [
18    """Hello there.
19Thank you for calling our support line.
20My name is Sarah and I'll be helping you today.
21Could you please provide your account number and tell me what issue you're experiencing?"""
22]
23prompt_types = ["short", "medium", "long"]
24
25base_request_payload = {
26    "max_tokens": 4096,
27    "voice": "tara",
28    "stop_token_ids": [128258, 128009],
29}
30
31
32async def stream_to_buffer(
33    session: aiohttp.ClientSession, label: str, payload: dict
34) -> bytes:
35    """Send one streaming request, accumulate into bytes, and log timings."""
36    req_id = str(uuid.uuid4())
37    payload = {**payload, "request_id": req_id}
38
39    t0 = time.perf_counter()
40
41    try:
42        async with session.post(
43            BASETEN_HOST,
44            json=payload,
45            headers={"Authorization": f"Api-Key {BASETEN_API_KEY}"},
46        ) as resp:
47            if resp.status != 200:
48                print(f"[{label}] ← HTTP {resp.status}")
49                return b""
50
51            buf = bytearray()
52            idx = 0
53            # *** CORRECTED: async for on the AsyncStreamIterator ***
54            async for chunk in resp.content.iter_chunked(4_096):
55                elapsed_ms = (time.perf_counter() - t0) * 1_000
56                if idx in [0]:
57                    print(
58                        f"[{label}] ← chunk#{idx} ({len(chunk)} B) @ {elapsed_ms:.1f} ms"
59                    )
60                buf.extend(chunk)
61                idx += 1
62
63            total_s = time.perf_counter() - t0
64            print(f"[{label}] ← done {len(buf)} B in {total_s:.2f}s")
65            return bytes(buf)
66
67    except Exception as e:
68        print(f"[{label}] ⚠️ exception: {e!r}")
69        return b""
70
71
72async def run_session(
73    session: aiohttp.ClientSession,
74    prompt: str,
75    ptype: str,
76    run_id: int,
77    semaphore: asyncio.Semaphore,
78) -> None:
79    """Wrap a single prompt run in its own error‐safe block."""
80    label = f"{ptype}_run{run_id}"
81    async with semaphore:
82        try:
83            payload = {**base_request_payload, "prompt": f"Chapter {run_id}: {prompt}"}
84            buf = await stream_to_buffer(session, label, payload)
85            if run_id < 3 and buf:
86                fn = f"output_{ptype}_run{run_id}.wav"
87                with open(fn, "wb") as f:
88                    f.write(buf)
89                print(f"[{label}] ➔ saved {fn}")
90
91        except Exception as e:
92            print(f"[{label}] 🛑 failed: {e!r}")
93
94
95async def run_with_offset(offset: int) -> None:
96    semph = asyncio.Semaphore(MAX_REQUESTS_PER_PROCESS)
97    connector = aiohttp.TCPConnector(limit_per_host=128, limit=128)
98    async with aiohttp.ClientSession(connector=connector) as session:
99        # warmup once per worker
100        await run_session(session, "warmup", "warmup", 90 + offset, semph)
101
102        tasks = []
103        for i, prompt in enumerate(prompts):
104            ptype = prompt_types[i]
105            print(f"\nWorker@offset {offset}{ptype} prompt starts…")
106            for run_id in range(offset, offset + PAYLOADS_PER_PROCESS):
107                tasks.append(run_session(session, prompt, ptype, run_id, semph))
108
109        await asyncio.gather(*tasks)
110        print(f"Worker@offset {offset} ✅ all done.")
111
112
113def run_with_offset_sync(offset: int) -> None:
114    try:
115        # create and run a fresh event loop in each process
116        asyncio.run(run_with_offset(offset))
117    except Exception as e:
118        print(f"Worker@offset {offset} ❌ error: {e}")
119
120
121def main():
122    offsets = [i * PAYLOADS_PER_PROCESS for i in range(NUM_PROCESSES)]
123    with ProcessPoolExecutor() as exe:
124        # map each offset to its own process
125        exe.map(run_with_offset_sync, offsets)
126
127    print("🎉 All processes completed.")
128
129
130if __name__ == "__main__":
131    main()

One key aspect of this client code is session re-use. Establishing a session introduces overhead – sometimes as much as a hundred milliseconds – substantially impacting observed TTFB performance. With session re-use, the vast majority of requests skip this bottleneck and run with the expected latency.

For more details on calling Orpheus, check out our sample code in the Orpheus TTS repo.

Run Orpheus TTS in production

Canopy Labs chose Baseten as its preferred inference platform for Orpheus TTS thanks to Baseten’s TensorRT-LLM model performance optimizations, scalable infrastructure, and feature-rich model management tooling. 

On a single H100 MIG, you can run Orpheus with:

  • 16 concurrent live connections with variable traffic

  • 24 concurrent live connections with stable traffic

  • Up to 60x real-time factor for bulk jobs

Orpheus TTS is a fast, configurable, cost-efficient option for speech synthesis for large-scale voice agents. You can get started building with Orpheus with a pre-optimized model library deployment and the client code example.