Models

Choosing the Right TTS Model for Your Needs

An in-depth guide to our text-to-speech models. Compare Toucan, Kokoro, Orpheus, and Chatterbox to find the perfect voice for your project.

Find the Perfect Voice: An Overview of Our TTS Models

Choosing the right text-to-speech (TTS) model is crucial for getting the perfect result. Whether you need a voice for a global audience, a hyper-realistic narration, an expressive character, or a clone of a specific voice, we have a model for you.

This guide will walk you through our powerful suite of TTS models, helping you compare their features and choose the best one for your project.

Quick Comparison

ModelBest ForLanguage SupportVoice CloningEmotion Control
ToucanGlobal reach & prosody control7000+YesNo
KokoroHighest audio quality & naturalness9 languagesNoNo
OrpheusExpressive & emotional speech7 languagesYesYes (tags)
ChatterboxHigh-fidelity voice cloningEnglishYesYes (exaggeration)

In-Depth Model Guides

Here's a closer look at each model and its unique capabilities.

Toucan: The Universal Language Powerhouse

The backbone of our TTS capabilities is the Toucan model. It's a highly versatile and powerful multilingual speech synthesis system capable of generating speech in over 7,000 languages. This incredible range makes our application truly global.

We utilize Toucan not just for direct text-to-speech synthesis, but also for its advanced capabilities in prosody extraction and editing. This allows for fine-grained control over the rhythm, intonation, and stress of the generated speech, enabling highly expressive and customized voice outputs. Every single inference in this app, regardless of the selected model, is processed by our Toucan model, which means you can modify the prosody of any inference.

  • Best for: Projects requiring a vast range of languages or deep control over speech delivery.
  • Key Features:
    • Massive Language Support: Over 7,000 languages.
    • Advanced Prosody Control: Fine-tune pitch, duration, and energy for expressive speech.
    • Voice Cloning: Supports reference audio to clone voices.
  • Learn More: Toucan is an open-source project from IMS Toucan.

Kokoro: The Quality Champion

When it comes to sheer audio quality and naturalness, Kokoro stands out. It produces incredibly natural-sounding speech, making it an excellent choice when your primary goal is the most realistic and pleasant-sounding voice possible.

  • Best for: High-quality narrations, audiobooks, and applications where realism is paramount.
  • Key Features:
    • Exceptional Audio Quality: Produces very natural and human-like speech.
    • Multiple Languages: Supports 9 languages with a variety of professional voices.
    • Streaming Support: Can generate audio with low latency.
  • Learn More: Kokoro is an Apache-2.0 project by hexgrad.

Orpheus: The Expressive Storyteller

Orpheus is designed for expressiveness. It's a multi-language TTS model that comes with a variety of voice options and emotional tags. This allows you to generate speech with specific emotional tones like [laugh], [sigh], or [gasp], making it perfect for creative applications.

  • Best for: Storytelling, gaming, character voice-overs, and dynamic content.
  • Key Features:
    • Emotional Range: Use tags to add expressions and emotions to the speech.
    • Multiple Languages: Supports 7 major languages with expressive voices.
    • Speech Control: Adjust duration and stability of the voice.
    • Voice Cloning: Zero-shot cloning from reference audio.
  • Learn More: Open-source on GitHub with a full Hugging Face model card.

Chatterbox: The Cloning Specialist

Chatterbox is a high-quality voice synthesis model with a focus on voice cloning. It excels at creating a digital replica of a voice from a reference audio sample. This is the model to use when you need to clone a specific voice with high fidelity.

  • Best for: Personalized voice applications, voice branding, and digital voice replication.
  • Key Features:
    • High-Quality Voice Cloning: Excellent for creating voice replicas from audio.
    • Streaming Support: Low-latency voice generation.
    • Emotion Control: Single-parameter exaggeration knob for expressive speech.
  • Learn More: MIT-licensed project from Resemble AI with a live demo.

How to Choose the Right Model

Still not sure which model to use? Here's a quick guide:

  • If you need to support many languages... use Toucan.
  • If you need the most natural, human-sounding voice... use Kokoro.
  • If you want to add emotions and expressions... use Orpheus.
  • If you want to clone your own voice or another voice... use Toucan or Chatterbox for the highest fidelity.

Related Pages

Frequently Asked Questions

Ready to Try Our Models?

Experience the power of advanced TTS technology with our professional-grade models.