Prosody is the "music" of speech—the rhythm, pitch, stress, and intonation that transforms flat text into engaging, expressive audio. It's what makes the difference between a robotic voice and a captivating performance that holds your audience's attention.
Every single inference in this app, regardless of the selected model, is processed by our Toucan model. The Toucan model extracts prosody from the audio, which means you can modify the prosody of any inference using the tools described here.
Our platform puts you in the director's chair with two powerful approaches to prosody control: precision phoneme-level editing and global parameter adjustments. Whether you're crafting the perfect audiobook narration, creating compelling marketing content, or developing character voices, these tools give you unprecedented control over every aspect of speech expression.
Why Prosody Control Matters
Imagine the difference between a monotone GPS voice and a skilled audiobook narrator. The words might be identical, but the prosody—how they're spoken—creates entirely different experiences. Good prosody control means:
- Better engagement: Expressive speech keeps listeners interested
- Clearer communication: Proper emphasis ensures your message is understood
- Brand consistency: Maintain the same vocal style across all your content
- Emotional impact: Convey the right mood and feeling through vocal expression
- Professional quality: Create content that rivals human narration
Two Levels of Prosody Control
Quick Global Adjustments vs. Precision Editing
Approach | Best For | Speed | Control Level | Learning Curve |
---|---|---|---|---|
Parameter Sliders | Quick adjustments, consistent style | Fast | Broad changes | Beginner-friendly |
Phoneme-Level Editing | Precise control, specific moments | Detailed | Surgical precision | Advanced |
Global Prosody Parameters: The Quick Way to Transform Speech
Speech Expression Controls
These sliders let you quickly dial in the overall character and style of your speech generation:
Prosody Creativity (0.0 - 1.2)
Controls how much the AI deviates from typical speech patterns. Think of it as the "personality dial" for your voice:
- Low values (0.0-0.4): Conservative, predictable speech patterns
- Medium values (0.5-0.8): Natural variation with some expressiveness
- High values (0.9-1.2): Bold, highly expressive speech with dramatic variation
Perfect for: Adjusting between corporate presentations (low) and storytelling (high)
Duration Scaling (0.7x - 1.3x)
The master speed control for your entire speech output:
- Faster (0.7-0.9x): Quick delivery for news, advertisements, or energetic content
- Normal (1.0x): Natural speaking pace
- Slower (1.1-1.3x): Deliberate, thoughtful delivery for educational or dramatic content
Pro tip: Slight speed adjustments (±0.1x) often sound more natural than dramatic changes
Pitch Variance (0.6 - 1.4)
Controls the melodic range of speech—how much the voice goes up and down:
- Low variance (0.6-0.8): Monotone to subtle variation, good for serious content
- Medium variance (0.9-1.1): Natural conversational patterns
- High variance (1.2-1.4): Animated, expressive delivery perfect for entertainment
Energy Variance (0.6 - 1.4)
Adjusts the dynamic range of volume and emphasis:
- Low energy (0.6-0.8): Calm, consistent volume levels
- Medium energy (0.9-1.1): Natural emphasis patterns
- High energy (1.2-1.4): Dramatic emphasis with strong peaks and valleys
Pause Duration (0.8x - 1.2x)
Controls the length of natural pauses between words and phrases:
- Shorter pauses (0.8-0.9x): Rapid-fire delivery, urgent feeling
- Normal pauses (1.0x): Natural conversation rhythm
- Longer pauses (1.1-1.2x): Thoughtful, dramatic timing
Loudness (-30 to -18 dB)
Sets the overall volume level of your generated audio:
- Quieter (-30 to -24 dB): Intimate, conversational volume
- Standard (-24 dB): Optimal level for most content
- Louder (-18 dB): Bold, attention-grabbing volume
Voice Characteristic Sliders
Six specialized controls that modify the fundamental character of the voice itself:
- Voice Timbre: Affects the basic color and resonance of the voice
- Vocal Register: Influences the natural pitch range and register tendencies
- Speaking Style: Modifies rhythm patterns and delivery style
- Vocal Texture: Adjusts breathiness and texture qualities
- Accent Control: Affects pronunciation and accent characteristics
- Emotional Undertone: Influences the underlying emotional quality
These sliders work together to create unique vocal personalities—experiment with combinations to find your signature sound.
Precision Phoneme-Level Control: Surgical Speech Editing
For ultimate control, our visual editor lets you adjust the prosody of individual sounds (phonemes) with precision that rivals professional audio editing software.
The Visual Prosody Editor
Interactive Spectrogram Display
See the actual acoustic fingerprint of your speech with an interactive spectrogram that shows:
- Frequency patterns: Visual representation of pitch and timbre
- Intensity levels: Energy and volume across time
- Phoneme boundaries: Exact timing of each speech sound
- Word segmentation: Clear divisions between words and phrases
Three-Layer Control System
1. Pitch Control Nodes Drag circular nodes to adjust the pitch of individual phonemes:
- Surgical precision: Target specific words or syllables
- Smooth curves: Automatic interpolation between control points
- Visual feedback: See your changes in real-time on the spectrogram
- Musical control: Create intentional pitch patterns for emphasis
Use cases: Emphasize questions, create dramatic rises, fix unnatural pitch jumps
2. Energy Bars Adjust the volume and intensity of each phoneme:
- Dynamic emphasis: Make specific words punch through
- Subtle shaping: Create natural stress patterns
- Volume balancing: Ensure consistent audibility
- Dramatic effects: Create whisper-to-shout transitions
Perfect for: Highlighting key words, creating emotional dynamics, balancing audio levels
3. Duration Handles Stretch or compress the timing of individual sounds:
- Rhythm control: Speed up or slow down specific passages
- Emphasis timing: Lengthen important words for impact
- Natural pacing: Fix rushed or dragged pronunciation
- Dramatic pauses: Create suspenseful timing
Essential for: Pacing control, emphasis, natural rhythm, dramatic timing
Smart Visual Feedback System
Our editor uses intelligent color coding to help you track your modifications:
- Blue controls: Unmodified, original values
- Amber highlights: Category has been modified
- Red indicators: Individual controls that have been changed
- Transparent overlays: See the spectrogram through unmodified controls
Lock and Preserve System
Protect your work with category-specific locks:
- Lock pitch: Preserve your melodic work while adjusting timing
- Lock energy: Maintain emphasis patterns during speed changes
- Lock duration: Keep timing perfect while tweaking pitch and volume
- Smart reset: Revert individual categories without losing other work
Prosody Control Strategies by Content Type
Audiobook Narration
- Parameter approach: Medium prosody creativity (0.6-0.8), natural pace (0.95-1.05x)
- Precision editing: Adjust dramatic moments, character voice distinction
- Focus areas: Chapter transitions, dialogue emphasis, emotional scenes
Marketing and Advertisements
- Parameter approach: Higher energy variance (1.1-1.3), faster pace (0.8-0.9x)
- Precision editing: Punch up key selling points, brand name emphasis
- Focus areas: Call-to-action phrases, product benefits, urgency creation
Educational Content
- Parameter approach: Lower variance for clarity, slightly slower pace (1.05-1.1x)
- Precision editing: Emphasize key concepts, create natural pauses for comprehension
- Focus areas: Important terms, step-by-step instructions, summary points
Podcast Intros and Outros
- Parameter approach: High prosody creativity (0.8-1.0), engaging energy patterns
- Precision editing: Perfect the hook, nail the brand consistency
- Focus areas: Show name pronunciation, tagline delivery, musical timing
Character Voices and Gaming
- Parameter approach: Extreme settings for unique personalities
- Precision editing: Create signature speech patterns, emotional reactions
- Focus areas: Character catchphrases, emotional outbursts, personality quirks
Advanced Prosody Techniques
Creating Natural Conversations
When generating dialogue, vary the prosody between speakers:
- Character A: Higher pitch variance, faster pace for energetic personality
- Character B: Lower energy variance, deliberate timing for calm authority
- Narrator: Balanced settings with slight emphasis on key story moments
Emotional Prosody Patterns
Different emotions have distinct prosody signatures:
- Excitement: High energy variance, faster pace, increased pitch range
- Sadness: Slower duration scaling, reduced energy, lower pitch variance
- Anger: High energy, sharp pitch changes, deliberate pauses
- Calm: Consistent energy levels, natural pace, gentle pitch curves
Brand Voice Consistency
Develop your signature sound:
- Start with reference audio from your brand
- Analyze the prosody patterns using our visual editor
- Create parameter presets that match those patterns
- Apply consistently across all content for brand recognition
Pro Tips for Mastering Prosody Control
Start Broad, Then Refine
Begin with parameter sliders to establish the overall character, then use precision editing for specific moments that need attention.
Use Reference Audio
Listen to professional voice actors in similar content. What makes their delivery compelling? Try to recreate those patterns with our controls.
Save Your Settings
Found the perfect combination of parameters for your brand? Save those settings as your baseline for consistent results.
Layer Your Edits
Make changes incrementally. Adjust timing first, then pitch, then energy. This prevents overwhelming yourself and helps you understand each control's impact.
Preview Frequently
Generate short samples as you work. It's easier to fine-tune with frequent feedback than to make large changes blindly.
Study the Spectrogram
The visual representation teaches you about speech patterns. The more you understand what good prosody looks like visually, the better your editing becomes.
Troubleshooting Common Prosody Issues
"My speech sounds robotic"
- Solution: Increase prosody creativity and pitch variance
- Precision fix: Add subtle pitch variations to flat sections
"The pacing feels unnatural"
- Solution: Adjust pause duration and use precision timing controls
- Precision fix: Vary the duration of phonemes for more natural rhythm
"Important words don't stand out"
- Solution: Increase energy variance or use precision energy controls
- Precision fix: Boost energy and slightly lengthen key words
"The voice sounds too dramatic"
- Solution: Reduce prosody creativity and variance parameters
- Precision fix: Smooth out extreme pitch or energy peaks
"Speech is too fast/slow overall"
- Solution: Adjust duration scaling parameter
- Precision fix: Target specific sections that feel rushed or dragged
The Future of Speech Expression
Prosody control represents the cutting edge of text-to-speech technology. By mastering these tools, you're not just generating speech—you're crafting performances. Whether you're creating content for business, entertainment, education, or art, the ability to control every nuance of speech expression puts professional-quality voice production at your fingertips.
Ready to transform your text into compelling audio experiences? Start with our parameter sliders to establish your baseline, then dive into precision editing to perfect those crucial moments that make all the difference.