Build a production-ready voice agent with Baseten, LiveKit, and LlamaIndex

Nearly every AI-native company is building with voice. But trying to ship a voice application by cobbling together closed source APIs comes with skyrocketing per-minute costs and latency walls as the application gains traction and scale.

Thankfully, developers now have stellar alternatives. Today, open-source models on different parts of the pipeline (speech-to-text, LLMs, and text-to-speech) have reached frontier quality. Any developer can now scale real-time voice AI applications in production with SOTA quality, latency, and economics, while owning their entire technology stack.

Whether you’re building accurate and customized customer support agents, automated phone-based ordering for restaurants, or even an AI SDR for your industry, this tutorial covers how to build a robust voice agent with access to custom context.

We use 3 core libraries in building this project:

LiveKit Agents, an orchestration framework for multimodal and voice AI, including thoughtful features like turn detection, noise cancellation and interruption handling.
Baseten, an AI inference platform that enables you to run highly performant production-grade open-source models.
LlamaIndex, an orchestration framework that enables you to build robust RAG over complex documents and agentic workflows in enterprise.

Prerequisites

Before getting started, you'll need:

1. Baseten Account: Sign up at baseten.co.

2. LiveKit Account: Sign up at cloud.livekit.io.

3. Python 3.10+: Ensure you have Python installed on your system. We suggest using the pyenv library and starting in a fresh environment.

Step 1: Deploy the models

To build a voice agent, you need three models working together:

STT (Speech-to-Text): This model listens to the user and transcribes their speech. We’ll use a custom implementation of Whisper with support for streaming input and output, reducing latency.
LLM (Large Language Model): This model handles the “thinking and reasoning” stage of the voice agent, responding to user queries. We’ll use DeepSeek V3, a frontier open source language model.
TTS (Text-to-Speech): This model speaks back to the user. We’ll use Orpheus, a leading open-source speech synthesis model that sounds natural.

To get started, create dedicated deployments of the STT and TTS models from our model library. We suggest utilizing the default deployment hardware for these models (H100 MIGs) where the combination of the TensorRT-LLM framework and the hopper architecture ensure high performance:

And add a Model API for DeepSeekV3, which lets you pay per token for access to the largest and highest performant LLM.

After these have been successfully deployed, your Baseten models dashboard should look like this:

This is what your Baseten dashboard should look like after setting up the necessary models.

Step 2: Set up the voice agent repository

Run the following commands to clone the repo and install dependencies.

git clone https://github.com/basetenlabs/voice-agent-baseten.git
cd voice-agent-baseten
pip install -r requirements.txt
pnpm install

Grab api keys from your Livekit and Baseten account settings, see .env.example and create your own .env.local.

# Environment variables needed to connect to the LiveKit server.
LIVEKIT_API_KEY=<LIVEKIT_API_KEY>
LIVEKIT_API_SECRET=<LIVEKIT_API_SECRET>
LIVEKIT_URL=<LIVEKIT_URL>

# Baseten required environment variables
BASETEN_API_KEY=<BASETEN_API_KEY>

Replace the 2 endpoints in baseten_rag_agent.py with the endpoints from your Baseten account by changing lines 114 to your actual whisper endpoint with websocket format and line 123 with your Orpheus endpoint with https format. You can find these under your dedicated deployment.

This is where you can grab your dedicated API endpoint.

We have already set up the Deepseek LLM to point to the correct model APIs endpoint for you so no changes required.

Step 3: Run the agent

[optional] Open scrape_doc.py and replace `BASE_URL` with your own docs that you want to perform RAG on. Then run `python scrape_docs.py` to scrape a set of live docs based on a sitemap. Alternatively, manually fill the `data` folder with txt files. We have prefilled the data folder with the Baseten API docs as an example.

Open 2 terminal sessions and run the following commands in the agent-start-react-baseten folder:

pnpm dev to launch the next js webapp
python baseten_rag.py dev to connect the voice agent to livekit cloud room

Visit http://localhost:3000 and click ‘smart demo’ or ‘fast demo’ to enter the same livekit cloud room as the voice agent.

Now you can start chatting in real-time!

How it works

Building the core voice loop

The core of our voice agent is the `entrypoint` function that orchestrates the entire pipeline. Let's break down how it connects to LiveKit and strings together the three AI models.

The `entrypoint` function serves as the main orchestrator that:

Establishes connection: First connects to LiveKit using `ctx.connect()` with custom SSL configuration for secure communication
Creates the agent: Instantiates an `Agent` with all the necessary components:
Instructions: Defines the agent's personality and behavior, emphasizing plain text responses suitable for voice interaction
Tools: Includes the `query_info` function for RAG capabilities
VAD (Voice Activity Detection): Uses Silero VAD to detect when users are speaking
STT (Speech-to-Text): Baseten's WhisperV3 model converts speech to text via WebSocket
LLM (Language Model): DeepSeekV3 handles reasoning and response generation
TTS (Text-to-Speech): Orpheus3B converts the AI's text responses back to natural speech
Manages session: Creates an `AgentSession` and starts it with the agent and room context
Initiates interaction: Sends an initial greeting to begin the conversation

This creates a complete voice AI pipeline where audio flows from the user through STT → LLM → TTS back to the user, with RAG capabilities available through the `query_info` tool, which we will now get into!

The speech to text, large language model, text to speech pipeline.

The entrypoint function:

1agent = Agent(
2  instructions="...",
3  tools=[query_info],
4       vad=silero.VAD.load(),
5       stt=baseten.STT(
6           api_key=baseten_api_key,
7           model_endpoint="wss://model-xxxxxxx.api.baseten.co/v1/websocket"
8       ),
9       llm=openai.LLM(
10           api_key=baseten_api_key,
11           base_url="https://inference.baseten.co/v1",
12           model="deepseek-ai/DeepSeek-V3-0324",
13       ),
14       tts=baseten.TTS(
15           api_key=baseten_api_key,
16           model_endpoint="https://model-xxxxxxx.api.baseten.co/environments/production/predict",
17       ),
18   )
19

Adding context

Indexing

To reduce hallucination, we’ll want to implement RAG. RAG grounds the LLM responses by supplying the LLM with relevant information from a source of truth such as a set of documents before answering the user question.

We use the HuggingFace `BAAI/bge-small-en-v1.5`, a lightweight embedding that lives locally so we don’t introduce additional latency.

Now create a vector index from documents in the data folder if it doesn't exist, or load an existing index from storage for efficient semantic search and retrieval. (This assumes that you've loaded documents in a .txt format into a data folder (in step 3 above)).

if not PERSIST_DIR.exists():
   # load the documents and create the index
   documents = SimpleDirectoryReader(THIS_DIR / "data").load_data()
   index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
   # store it for later
   index.storage_context.persist(persist_dir=str(PERSIST_DIR))
else:
   # load the existing index
   storage_context = StorageContext.from_defaults(persist_dir=str(PERSIST_DIR))
   index = load_index_from_storage(storage_context, embed_model=embed_model)

Retrieval

To retrieve documents, use a decorator within Livekit that transforms a regular Python function into a tool that the voice agent can use when needed. For example, if the user asks a question about Baseten, the agent can query info by retrieving the relevant context from our previously built vector index, which our Deepseek endpoint can then synthesize and respond.

1@llm.function_tool
2async def query_info(query: str) -> str:
3  
4   # Set up the LLM with Baseten endpoint
5   baseten_deepseek = OpenAILike(
6       api_key=baseten_api_key,
7       api_base="https://inference.baseten.co/v1",
8       model="deepseek-ai/DeepSeek-V3-0324",
9       is_chat_model=True,
10   )
11  
12   # Use the simple query_engine pattern with custom system prompt
13   query_engine = index.as_query_engine(
14       use_async=True,
15       llm=baseten_deepseek,
16       system_prompt="..."
17   )
18   res = await query_engine.aquery(query)
19   print("Query result:", res)
20   return str(res)

Scaling voice agents in production

Congrats! You’ve now created a fully functional voice agent that can handle customer support tickets or process pizza delivery orders.The best part is with state-of-the-art open source models, you can not only perform RAG on your domain’s data, but also fine-tune on data that you might want to adapt your base model for higher quality responses.

In this tutorial, we built a

Real time voice loop
Custom context via RAG
Used frontier open source models while maintaining low latency.

Whatever you’re building,you can use our voice agent starter kit to accelerate your time to production. We look forward to seeing what you build!

Special thank you to our friends at LiveKit for supporting our integration. You can find a number of LiveKit voice agents examples in their recipes page which can be adapted to use open source models deployed on Baseten, including the agent starter kit which we adapted for this demo.