Chatterbox, A New Open-Source TTS Model from Resemble AI

Published on June 6, 2025

Chatterbox, A New Open-Source TTS Model from Resemble AI

Introduction

Text-to-Speech (TTS) models, as their name suggests, convert text into speech. We have all heard that personalization is one of the biggest leverage-points of AI. When it comes to Voice AI, TTS models with voice cloning capabilities allow one to tailor voices to desired languages, accents, and emotional tones so that interactions feel more personal and engaging. An excellent application is audiobooks where entire books can be generated in the author’s voice. We’ll show you how you can potentially approach this in the implementation section of this article.

Thanks to TTS models, information isn’t just something you read, it’s something you can absorb while you’re cooking, driving, or waiting in line. If you haven’t tried NotebookLM already, we encourage you to do so - it’s incredible. Among its many features, NotebookLM generates a podcast with natural sounding voices creating digestible and engaging audios of your uploaded documents and links.

Our AI content team has been looking a lot at TTS models, such as Nari Lab’s Dia. Interestingly, the TTS models we’ve been exploring don’t have a research paper - which makes sense given the small teams that are accomplishing these amazing feats. For example Nari Labs, which released Dia, only has two people who worked on the model and Chatterbox, which we are about to cover, is currently a three person team. We’re very excited about the progress made by these small but mighty teams.

Chatterbox TTS

trending

Resemble AI recently launched their first open source TTS model with a MIT license. This model has been trending on Hugging Face since its release. What’s unique about this model is that it introduces a feature they call emotion exaggeration control. Feel free to play around with this adjustable exaggeration parameter in their demo. demo

Resemble AI acknowledges Cosyvoice, HiFT-GAN, and Llama 3 (now deprecated) as inspiration. Audio files generated by Chatterbox incorporate the PerTH Watermarker, allowing for detection of AI content.

Performance

The voice cloning ability of Chatterbox is very impressive. When testing, our team found that the voices bore remarkable similarity to our own and the generations were very impressive. For those interested in comparisons to ElevenLabs, AB testing is available on Podonos.

Implementation

This article will cover two implementation options for using the Chatterbox TTS model:

Gradio:Using the Gradio interface for quick testing and interaction
Creating an Audiobook:Generating an audiobook by cloning an author’s voice and processing text segments

Option 1: Gradio

Step 1: Set up a GPU Droplet

Begin by setting up a DigitalOcean GPU Droplet, select AI/ML, and choose the NVIDIA H100 option.
AI-ML

Step 2: Web Console

Once your GPU Droplet finishes loading, you’ll be able to open up the Web Console.

Step 3: Install Dependencies

Next, install the necessary software packages. In the web console, paste and run the following commands to install pip for managing Python packages and git-lfs for handling large files:

apt update
apt install python3-pip python3.10 git-lfs -y

Step 4: Gradio App

Now, download the application code from Hugging Face and prepare its environment.

git-lfs clone https://huggingface.co/spaces/ResembleAI/Chatterbox
cd Chatterbox

python3 -m venv venv_chatterbox
source venv_chatterbox/bin/activate


pip3 install -r requirements.txt
pip3 install spaces

To make your Gradio app accessible over the internet, you need to make a small change to its source code.

Open the main application file in the Vim text editor:

vim app.py

Press the i key to enter INSERT mode. You’ll see -- INSERT -- at the bottom of the terminal. Then, locate the last line of the file, which likely looks something like demo.launch(). Modify it to include share=True:

demo.launch(share=True)

Press the ESC key to exit INSERT mode. Afterwards, type :wq and press Enter to save your changes and exit Vim.

Step 6: Launch the Gradio App

You’re all set! Run the application with the following command:

python3 app.py

After the script initializes, you will see a public URL in the terminal output. Open this URL in your web browser to interact with your live Gradio application.

Option 2: Creating an Audiobook

Step 1: Prepare a Reference Audio Sample

To create an audiobook, Chatterbox requires a short audio sample of the author’s voice to clone it effectively. For optimal results, the Resemble AI Team recommends that audio recordings should be at least 10 seconds in duration and ideally in WAV format. Furthermore, the audio should have a 24k sample rate or higher, feature a single speaker with no background noise, and if possible, be recorded on a professional microphone. The content and speaking style are also important; the context of the spoken sentence should match the emotion in the audio file, and the reference clip’s speaking style should be similar to the desired output, such as using an audiobook-style clip for audiobook generation. audio for cloning

Check option 1 earlier in this tutorial for instructions on setting up a GPU Droplet, cloning the Chatterbox repo and setting up a virtual environment. The code snippet below can be pasted into the terminal to install the necessary packages.

pip3 install chatterbox-tts torchaudio

Step 3: Generate Speech Using the Author’s Voice

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

# Load the pre-trained Chatterbox model
model = ChatterboxTTS.from_pretrained(device="cuda")  # Use "cpu" if CUDA is unavailable

# Define the text to be converted into speech
text = "Your audiobook text goes here."

# Specify the path to the reference audio sample
audio_prompt_path = "author_sample.wav"

# Generate the speech waveform
wav = model.generate(text, audio_prompt_path=audio_prompt_path)

# Save the generated audio to a file
ta.save("audiobook_segment.wav", wav, model.sr)

Replace “Your audiobook text goes here.” with the actual text from your audiobook and author_sample.wav with the path to your reference audio file.

Step 4: Adjust Voice Characteristics

You can adjust the expressiveness and pacing of the synthesized voice using the exaggeration and cfg parameters:
exaggeration: Controls emotional expressiveness. Higher values make the speech more dramatic.
cfg (classifier-free guidance): Adjusts the adherence to the reference voice’s characteristics. Lower values can slow down the speech for clarity.

wav = model.generate(
    text,
    audio_prompt_path=audio_prompt_path,
    exaggeration=0.7,  # More expressive
    cfg=0.3            # Slower, more deliberate pacing
)

Step 5: Compile the Audiobook

Process each chapter or section of your audiobook individually, generating the corresponding audio files. Once all segments are synthesized, use an audio editing tool like Audacity to:

Concatenate the audio segments in the correct order.
Add background music or sound effects if desired.
Ensure consistent volume levels and audio quality throughout.
Finally, export the complete audiobook in your preferred format (e.g., MP3, wav).

Final Thoughts

Chatterbox, developed by Resemble AI, is a recently released text-to-speech model with impressive voice cloning abilities and natural sounding voices. The model can be implemented in Gradio and can be incorporated for a variety of use cases (e.g. audiobooks). Chatterbox represents the significant progress we have made in personalized Voice AI.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Melani Maheswaran

Author

See author profile

Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.

See author profile

Category:

Tutorial

Tags:

AI/ML