Text-to-Speech (TTS) models, as their name suggests, convert text into speech. We have all heard that personalization is one of the biggest leverage-points of AI. When it comes to Voice AI, TTS models with voice cloning capabilities allow one to tailor voices to desired languages, accents, and emotional tones so that interactions feel more personal and engaging. An excellent application is audiobooks where entire books can be generated in the author’s voice. We’ll show you how you can potentially approach this in the implementation section of this article.
Thanks to TTS models, information isn’t just something you read, it’s something you can absorb while you’re cooking, driving, or waiting in line. If you haven’t tried NotebookLM already, we encourage you to do so - it’s incredible. Among its many features, NotebookLM generates a podcast with natural sounding voices creating digestible and engaging audios of your uploaded documents and links.
Our AI content team has been looking a lot at TTS models, such as Nari Lab’s Dia. Interestingly, the TTS models we’ve been exploring don’t have a research paper - which makes sense given the small teams that are accomplishing these amazing feats. For example Nari Labs, which released Dia, only has two people who worked on the model and Chatterbox, which we are about to cover, is currently a three person team. We’re very excited about the progress made by these small but mighty teams.
Resemble AI recently launched their first open source TTS model with a MIT license. This model has been trending on Hugging Face since its release. What’s unique about this model is that it introduces a feature they call emotion exaggeration control. Feel free to play around with this adjustable exaggeration parameter in their demo.
Resemble AI acknowledges Cosyvoice, HiFT-GAN, and Llama 3 (now deprecated) as inspiration. Audio files generated by Chatterbox incorporate the PerTH Watermarker, allowing for detection of AI content.
The voice cloning ability of Chatterbox is very impressive. When testing, our team found that the voices bore remarkable similarity to our own and the generations were very impressive. For those interested in comparisons to ElevenLabs, AB testing is available on Podonos.
This article will cover two implementation options for using the Chatterbox TTS model:
Begin by setting up a DigitalOcean GPU Droplet, select AI/ML, and choose the NVIDIA H100 option.
Once your GPU Droplet finishes loading, you’ll be able to open up the Web Console.
Next, install the necessary software packages. In the web console, paste and run the following commands to install pip for managing Python packages and git-lfs for handling large files:
apt update
apt install python3-pip python3.10 git-lfs -y
Now, download the application code from Hugging Face and prepare its environment.
git-lfs clone https://huggingface.co/spaces/ResembleAI/Chatterbox
cd Chatterbox
python3 -m venv venv_chatterbox
source venv_chatterbox/bin/activate
pip3 install -r requirements.txt
pip3 install spaces
To make your Gradio app accessible over the internet, you need to make a small change to its source code.
Open the main application file in the Vim text editor:
vim app.py
Press the i key to enter INSERT mode. You’ll see -- INSERT -- at the bottom of the terminal. Then, locate the last line of the file, which likely looks something like demo.launch(). Modify it to include share=True:
demo.launch(share=True)
Press the ESC key to exit INSERT mode. Afterwards, type :wq and press Enter to save your changes and exit Vim.
You’re all set! Run the application with the following command:
python3 app.py
After the script initializes, you will see a public URL in the terminal output. Open this URL in your web browser to interact with your live Gradio application.
To create an audiobook, Chatterbox requires a short audio sample of the author’s voice to clone it effectively.
For optimal results, the Resemble AI Team recommends that audio recordings should be at least 10 seconds in duration and ideally in WAV format. Furthermore, the audio should have a 24k sample rate or higher, feature a single speaker with no background noise, and if possible, be recorded on a professional microphone. The content and speaking style are also important; the context of the spoken sentence should match the emotion in the audio file, and the reference clip’s speaking style should be similar to the desired output, such as using an audiobook-style clip for audiobook generation.
Check option 1 earlier in this tutorial for instructions on setting up a GPU Droplet, cloning the Chatterbox repo and setting up a virtual environment. The code snippet below can be pasted into the terminal to install the necessary packages.
pip3 install chatterbox-tts torchaudio
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
# Load the pre-trained Chatterbox model
model = ChatterboxTTS.from_pretrained(device="cuda") # Use "cpu" if CUDA is unavailable
# Define the text to be converted into speech
text = "Your audiobook text goes here."
# Specify the path to the reference audio sample
audio_prompt_path = "author_sample.wav"
# Generate the speech waveform
wav = model.generate(text, audio_prompt_path=audio_prompt_path)
# Save the generated audio to a file
ta.save("audiobook_segment.wav", wav, model.sr)
Replace “Your audiobook text goes here.” with the actual text from your audiobook and author_sample.wav
with the path to your reference audio file.
You can adjust the expressiveness and pacing of the synthesized voice using the exaggeration and cfg parameters:
exaggeration
: Controls emotional expressiveness. Higher values make the speech more dramatic.
cfg
(classifier-free guidance): Adjusts the adherence to the reference voice’s characteristics. Lower values can slow down the speech for clarity.
wav = model.generate(
text,
audio_prompt_path=audio_prompt_path,
exaggeration=0.7, # More expressive
cfg=0.3 # Slower, more deliberate pacing
)
Process each chapter or section of your audiobook individually, generating the corresponding audio files. Once all segments are synthesized, use an audio editing tool like Audacity to:
Chatterbox, developed by Resemble AI, is a recently released text-to-speech model with impressive voice cloning abilities and natural sounding voices. The model can be implemented in Gradio and can be incorporated for a variety of use cases (e.g. audiobooks). Chatterbox represents the significant progress we have made in personalized Voice AI.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.