Large language models make it easy to generate high-quality conversational text. The challenge usually appears when you try to turn that text into speech.
Traditional text-to-speech pipelines often require generating the entire audio file before playback begins. That introduces buffering, additional infrastructure, and latency that can easily break the flow of a real-time conversation. For voice agents, even small delays make the interaction feel slow and unnatural.
As a result, developers often build complex streaming systems simply to deliver audio fast enough for conversational use cases.
Streaming TTS changes that architecture. Instead of waiting for a full audio response, speech is generated incrementally and streamed to the client in small chunks. The agent can start speaking almost immediately while the rest of the response is still being produced.
In this tutorial, we’ll build a real-time multilingual voice agent in Python using Async’s streaming TTS API, which supports more than 500 voices across 15 languages and delivers speech with around 300 ms latency.
What is a multilingual voice agent?
A multilingual voice agent is an AI system that can understand and respond to users using speech across multiple languages. It typically combines speech recognition, a language model, and text-to-speech. For these systems to feel natural, responses must begin quickly, which makes low-latency streaming TTS essential.
Voice interfaces are becoming common across AI assistants, support automation, and conversational apps. Users expect responses to start almost immediately. Traditional TTS pipelines often wait for the full text response before generating audio, which introduces noticeable delays in voice interactions.
The latency problem in voice AI
Voice conversations depend on tight timing. In natural dialogue, responses typically start within a few hundred milliseconds. When a voice assistant pauses too long before speaking, the interaction quickly feels slow or robotic.
Traditional TTS systems add latency because they generate the full audio output before playback begins. When responses come from LLMs, longer answers can introduce additional latency.
Why streaming TTS solves the problem
Streaming TTS changes how speech is generated. Instead of waiting for the full text response, the system starts synthesizing speech as soon as the first tokens arrive from the LLM. Those tokens are converted into low-latency audio chunks and streamed to the client in real time.
The result is simple: your voice agent can start speaking almost immediately, which keeps the conversational flow intact.
What we’re building in this tutorial
In this guide, we’ll build a multilingual voice agent using Python and Async’s streaming TTS API. The goal is simple: turn LLM responses into speech instantly so your application behaves like a real conversational system.
Instead of generating full audio files, the system will use real-time text-to-speech to stream audio as soon as the language model produces output. This approach allows a voice AI agent to begin speaking almost immediately, which keeps conversations responsive.
By the end of this tutorial, you’ll have a working voice pipeline that can power an AI voice assistant capable of responding naturally and switching between languages.
Voice agent capabilities
The voice AI agent we build will:
• receive responses from an LLM
• convert responses into speech using streaming TTS
• deliver real-time text-to-speech audio to the user
• support multiple languages and voices
This setup reflects how modern conversational systems connect LLM outputs directly to real-time speech generation.
Example use cases
Once this pipeline is in place, the same architecture can power many types of applications, including:
• AI voice assistants that respond conversationally
• customer support voice agents for automation
• voice-enabled apps for mobile or web platforms
• gaming NPC dialogue generated dynamically by an LLM
• education platforms with interactive voice tutors
Because the speech pipeline is built on streaming TTS, these systems can respond naturally while maintaining low latency.
Architecture of a real-time voice AI agent
A typical voice AI agent connects several components that process speech, generate responses, and deliver audio back to the user. At a high level, the system converts spoken input into text, uses a language model to generate a response, and then turns that response into speech using streaming TTS.
Voice pipeline overview
A common voice pipeline looks like this:
User → STT → LLM → Async Streaming TTS → Audio Output
• User: The interaction begins with spoken input.
• Speech-to-Text (STT): Transcribes the user’s speech into text.
• LLM: Generates a response based on the input and conversation context.
• Async Streaming TTS: Converts the generated text into speech.
• Audio Output: Streams the generated audio back to the user.
This pipeline forms the foundation of many modern AI voice assistants and conversational applications.
How streaming speech generation works
In a streaming setup, speech generation begins as soon as the language model starts producing text.
Instead of waiting for the entire response, the LLM outputs tokens progressively. These tokens are sent to the TTS system, which converts them into small audio segments and streams them to the client.
Because audio is delivered incrementally, the application can start playback immediately while the rest of the response continues to generate.
Quick setup: getting started with Async
To build a multilingual voice agent, you first need access to the Async Voice API, which provides real-time text-to-speech through a WebSocket streaming interface. The setup is straightforward and only takes a few minutes.
Create an Async account
Start by creating an account on the Async platform. This gives you access to the developer dashboard, where you can manage API keys, explore available voices, and test the real-time text-to-speech capabilities.
After signing up, you’ll be able to access the developer console and begin integrating the voice AI agent pipeline into your application.
Generate an API key
Once your account is ready, generate an API key from the developer dashboard. The API key is used to authenticate requests when connecting to the Async streaming endpoint.
You’ll include this key in your application when establishing the WebSocket connection for streaming TTS.
Install dependencies
For this tutorial, we’ll use Python to connect to the Async streaming API. Install the required dependencies using pip:
pip install websockets
The websockets library allows your application to connect to the Async streaming endpoint and receive audio chunks in real time. In the next section, we’ll use it to start building the voice agent.
Hands-on: Building the voice agent (Python Tutorial)
Now let’s connect everything and build the core of the voice pipeline.
The full example can run in roughly 100 lines of Python. It uses a WebSocket connection to stream audio in real time and play it immediately on the client.
Connecting to the Async streaming endpoint
First, establish a WebSocket connection to the Async streaming TTS endpoint. During initialization, you provide your API key, select a voice, and define the output audio format.
import asyncio
import websockets
import json
import base64
import numpy as np
import sounddevice as sd
API_KEY = "your_api_key"
WS_URL = "wss://api.async.com/text_to_speech/websocket/ws"
async def connect_tts():
async with websockets.connect(
WS_URL,
extra_headers={"x-api-key": API_KEY, "version": "v1"}
) as ws:
init_message = {
"model_id": "async_flash_v1.0",
"voice": {"mode": "id", "id": "default_voice_id"},
"output_format": {
"container": "raw",
"encoding": "pcm_s16le",
"sample_rate": 24000
}
}
await ws.send(json.dumps(init_message))
# Connection is now ready to send text and receive audio
Once the connection is initialized, the application can start sending text to the streaming TTS engine and receiving audio output in real time.
Streaming audio playback
The Async API returns audio chunks encoded in base64. Each chunk represents a small segment of speech generated by the TTS model.
To play the audio immediately, you decode the chunk, convert it into a NumPy array, and send it to the audio device.
For simplicity, the example below uses sd.play() to demonstrate real-time playback. In production systems, developers typically use a buffered audio stream or audio queue to avoid restarting playback for every chunk.
async for message in ws:
data = json.loads(message)
if data["type"] == "audioOutput":
audio_chunk = base64.b64decode(data["audio"])
audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
sd.play(audio_array, samplerate=24000)
Because the audio arrives incrementally, playback can begin right away instead of waiting for a full audio file.
Adding multilingual support
One advantage of building a multilingual voice agent is that the same speech pipeline can support multiple languages without changing the overall architecture. The application can select different voices or language configurations depending on the user’s request or the context of the conversation.
In some systems, the text-to-speech engine can also apply automatic language detection when the language is not explicitly specified, allowing the voice agent to generate speech in the appropriate language based on the input text.
Switching voices and languages
Language switching usually happens at the voice configuration level. When initializing the TTS connection, you can specify a different voice or language depending on the context of the conversation.
For example, your application might detect the user’s language automatically or allow users to choose their preferred voice.
init_message = {
"model_id": "async_flash_v1.0",
"voice": {
"mode": "id",
"id": "spanish_voice_id"
}
}
By updating the voice or language parameters, the same streaming TTS pipeline can generate speech in different languages without modifying the rest of the system.
Use cases for multilingual voice agents
Supporting multiple languages allows the same voice AI agent architecture to serve a global audience.
Common applications include:
• Global AI assistants that interact with users in their native language
• Multilingual support bots handling customer conversations across regions
• Real-time translation tools for spoken communication
• International education platforms with voice-based learning assistants
With a flexible speech pipeline in place, adding new languages often becomes a configuration change rather than a full system redesign.
Performance and latency considerations
When building a voice AI agent, responsiveness becomes one of the most important factors in user experience.
Streaming TTS improves this by starting audio generation immediately and delivering speech progressively. This trade-off between latency and audio quality is explored in the TTS latency vs quality benchmark comparing modern speech synthesis systems. Instead of waiting for a full audio file, the system streams audio as it’s produced, allowing the voice agent to begin speaking almost right away.
Time-to-first-byte
Time-to-first-byte (TTFB) refers to how long it takes for the first audio data to arrive after a request is sent to the TTS system.
In traditional pipelines, TTFB can be high because the entire audio response must be synthesized before anything is returned. With real-time text-to-speech, the first audio chunk can be generated as soon as the initial text tokens are available.
Lower TTFB allows voice responses to start much faster.
Conversational latency
Conversational systems depend on tight response timing. In human dialogue, pauses are usually short, and longer delays make interactions feel unnatural.
Streaming TTS helps reduce conversational latency because speech generation begins while the rest of the response is still being produced. The voice agent doesn’t need to wait for the entire response before starting playback.
Streaming audio delivery
Instead of delivering a single audio file, streaming TTS sends small audio chunks continuously to the client. These chunks can be played immediately as they arrive.
This progressive delivery keeps audio playback smooth and prevents large buffering delays during longer responses.
Scalability for concurrent sessions
Another advantage of streaming architectures is that they can scale more efficiently across multiple conversations.
Each voice session runs independently through the streaming pipeline, allowing multiple users to interact with the system simultaneously. This makes it easier to support production use cases such as AI voice assistants or customer support agents handling many conversations at once.
Possible extensions for production voice agents
Once the streaming TTS pipeline is in place, you can extend the system in several directions depending on the type of application you’re building.
Many teams start with a basic voice AI agent like the one in this guide and then integrate additional infrastructure for real-time communication, browser interfaces, or telephony.
Integrating with real-time voice frameworks
Frameworks such as LiveKit or Pipecat can manage real-time audio streaming, session handling, and media routing between users and AI agents.
In this setup, the framework handles microphone input and audio transport while the streaming TTS system generates speech responses from the LLM. This makes it easier to build scalable voice applications that support multiple concurrent users.
Building browser voice chat applications
The same pipeline can power voice chat experiences directly in the browser. A web client can capture microphone input, send it to the backend for transcription and LLM processing, and receive streamed audio responses from the TTS engine.
This approach is commonly used for AI voice assistants, voice chatbots, and interactive conversational tools.
Connecting to phone systems
Voice agents can also be connected to telephony platforms such as Twilio. In this case, incoming phone calls are transcribed, processed by the LLM, and then converted into speech using the TTS pipeline.
This allows companies to build automated voice support systems or AI-powered call assistants.
Adding interruption handling
In real conversations, users often interrupt the assistant while it is speaking. Production voice agents typically include interruption handling so the system can stop playback, process the new input, and respond immediately.
Handling interruptions helps maintain a natural conversational flow and improves the overall usability of the voice interface.
Build real-time multilingual voice agents without complex infrastructure
Not long ago, building a multilingual voice agent meant stitching together multiple speech systems, managing audio streaming infrastructure, and solving latency problems across the entire pipeline.
Modern streaming TTS APIs simplify this process significantly. Instead of building and maintaining custom speech infrastructure, developers can connect their language model directly to a real-time speech engine and start generating audio immediately.
In this tutorial, we built a simple voice AI agent that converts LLM responses into speech and streams audio back to the user in real time.
With Async handling real-time text-to-speech, low-latency audio delivery, and multilingual voices, developers can focus on building better conversational experiences instead of managing speech pipelines.
Try the Async Voice API and start building your own real-time voice agents.
Frequently asked questions about multilingual voice agents
What is a multilingual voice agent?
A multilingual voice agent is an AI system that can interact with users through speech in multiple languages. It typically combines speech recognition, a language model, and text-to-speech to understand spoken input and generate natural voice responses across different languages.
How does streaming text-to-speech work?
Streaming text-to-speech generates audio incrementally instead of producing a full audio file first. As text tokens are produced by the language model, the TTS system converts them into small audio chunks and streams them to the client for immediate playback.
Why is low latency important for voice AI agents?
Low latency keeps voice interactions natural. If a voice AI agent pauses too long before responding, the conversation feels slow and robotic. Starting audio playback quickly helps maintain conversational rhythm and improves the overall user experience.
Can voice AI assistants support multiple languages?
Yes. Modern AI voice assistants can support multiple languages by switching voices or language settings in the text-to-speech system. This allows the same voice agent to interact with users across different regions without changing the core architecture.
What are common use cases for voice AI agents?
Common use cases include AI assistants, customer support automation, voice-enabled applications, gaming characters, and education platforms. Many organizations use voice AI agents to provide conversational interfaces that feel more natural than traditional text-based systems.