Input & Output Modalities
Modalities allow your experts to interact through multiple channels beyond text. Configure voice input (speech-to-text) and audio output (text-to-speech) to create rich, multimodal conversational experiences.Overview
B-Bot Hub supports:- Input Modalities: How users communicate with your expert (voice, text, files)
- Output Modalities: How your expert responds (text, voice, images)
Accessing Modality Settings
Configure modalities when:- Creating an expert (Step 5: Models)
- Creating an assistant (Model configuration step)
- In chat settings (during conversations)
Input Modalities
Voice Input (Speech-to-Text)
Enable voice input to allow users to speak to your expert instead of typing.1
Enable Voice Input
Click Configure Input Modalities in the model selection step
2
Select Provider
Choose your speech-to-text provider:
- Browser Native: Uses the browser’s built-in speech recognition (free)
- OpenAI Whisper: High-accuracy transcription
- Google Speech-to-Text: Multi-language support
- Azure Speech: Enterprise-grade recognition
3
Configure Settings
- Language: Select primary language for recognition
- Continuous: Enable continuous listening mode
- Interim Results: Show transcription as user speaks
4
Test
Use the test button to verify voice input is working
Supported Input Types
Voice
Real-time voice recording and transcription
Files
Upload documents, images, audio files
Text
Traditional text input (always available)
Output Modalities
Text-to-Speech (Audio Output)
Enable audio output to have your expert speak responses aloud.1
Open Output Modalities
Click Configure Output Modalities in the model settings
2
Choose TTS Provider
Select from available providers:
- OpenAI TTS: Natural-sounding voices with emotion
- ElevenLabs: Ultra-realistic voice synthesis
- Google TTS: WaveNet voices, multi-language
- Azure Speech: Neural voices with customization
- Browser Native: Built-in browser synthesis (free, basic)
3
Select Voice
Each provider offers different voices:
- OpenAI: Alloy, Echo, Fable, Onyx, Nova, Shimmer
- ElevenLabs: 100+ premium voices
- Google: Standard and WaveNet voices
- Azure: Neural voices in 100+ languages
4
Configure Playback
- Auto-play: Automatically play audio when response completes
- Streaming TTS: Stream audio as text generates (where supported)
- Speed: Adjust playback speed (0.5x to 2.0x)
Voice Configuration
- OpenAI TTS
- ElevenLabs
- Google TTS
- Azure Speech
- Browser Native
Models:
tts-1: Standard quality, fasttts-1-hd: High definition, slower
- Alloy: Neutral, balanced
- Echo: Male, clear
- Fable: British accent, expressive
- Onyx: Deep, authoritative
- Nova: Female, energetic
- Shimmer: Soft, warm
- Real-time streaming
- Multiple languages
- Emotion in voice
- Fast generation
Advanced Configuration
API Key Selection
You can use different API keys for TTS than your main model key!
- Separate billing for different services
- Use specialized accounts for voice
- Manage rate limits independently
Streaming TTS
What is Streaming TTS?
What is Streaming TTS?
Streaming TTS generates and plays audio as the text is being generated, rather than waiting for the complete response.Benefits:
- Faster time-to-first-audio
- More natural conversation flow
- Better user experience
- Reduced perceived latency
Supported Providers
Supported Providers
Currently streaming TTS is supported by:
- ✅ OpenAI TTS
- ✅ ElevenLabs (with turbo models)
- ⚠️ Google TTS (partial support)
- ❌ Azure Speech (coming soon)
- ❌ Browser Native (not supported)
How to Enable
How to Enable
- Open Output Modalities configuration
- Select a supported provider
- Toggle “Stream audio as text generates”
- Save configuration
Auto-Play Settings
- Always On
- User Controlled
Audio plays automatically for every response.Best for:
- Voice-first applications
- Accessibility features
- Hands-free use cases
- Customer service bots
Using Voice in Chat
Voice Input
1
Enable Voice Mode
Click the microphone icon in the chat input area to enable voice mode
2
Hold to Record
Press and hold the microphone button while speaking
3
Release to Send
Release the button when done. Your speech will be transcribed and sent automatically.
- Speak clearly and at a normal pace
- Minimize background noise
- Use a good microphone
- Wait for the transcription to complete
Audio Output
When TTS is enabled:- Expert’s text response appears as normal
- Audio player appears below the message
- If auto-play is on, audio starts automatically
- Controls available: play/pause, speed, volume
Multimodal Content
Your experts can handle multiple content types in a single message:Voice + Text
User speaks a question, expert responds with text and audio
Image + Voice
User uploads an image and asks about it via voice
File + Text
User uploads a document and types a question
Mixed Media
Combine any input types in a single interaction
Best Practices
Voice Selection
Match Expert Personality
Match Expert Personality
Choose voices that align with your expert’s role:
- Professional: Clear, authoritative (Onyx, Echo)
- Friendly: Warm, approachable (Nova, Shimmer)
- Technical: Neutral, precise (Alloy)
- Customer Service: Empathetic, patient (Fable)
Consider Your Audience
Consider Your Audience
- Global: Use multi-language TTS providers
- Accessibility: Enable voice input and output by default
- Professional: Use high-quality voices (ElevenLabs, Azure Neural)
- Cost-Conscious: Browser native or OpenAI standard
Test Across Devices
Test Across Devices
Voice performance varies:
- Desktop browsers have better native support
- Mobile may have data usage considerations
- Test auto-play on mobile (may be blocked)
- Consider bandwidth limitations
Performance Optimization
Speed vs Quality
Fast (tts-1, turbo):
- Lower latency
- Good for chat
- Less compute intensive
- Better sound
- More natural
- Slightly slower
Streaming vs Buffered
Streaming:
- Faster start
- Better UX
- More complex
- Complete audio
- Simpler
- Small delay
Troubleshooting
Microphone Not Working
Microphone Not Working
Check:
- Browser permissions granted?
- Microphone connected and working?
- Try browser native option first
- Check browser console for errors
- Reload page and grant permissions
- Check system microphone settings
- Try a different browser
- Use HTTPS (required for mic access)
Audio Not Playing
Audio Not Playing
Check:
- Volume not muted?
- Auto-play enabled?
- Provider API key valid?
- Browser allows audio playback?
- Click play button manually
- Check provider key in settings
- Try different TTS provider
- Check browser audio settings
Poor Voice Quality
Poor Voice Quality
Solutions:
- Switch to HD model (tts-1-hd)
- Try ElevenLabs for premium quality
- Use neural voices (Azure, Google)
- Check internet connection speed
- Reduce playback speed if garbled
High Costs
High Costs
Cost-saving tips:
- Use browser native for testing
- Choose standard models over HD
- Disable auto-play (user controlled)
- Use OpenAI over ElevenLabs for lower cost
- Monitor usage in provider dashboard
Cost Comparison
| Provider | Quality | Speed | Cost (per 1M chars) | Best For |
|---|---|---|---|---|
| Browser | ⭐⭐ | ⚡⚡⚡ | Free | Testing, demos |
| OpenAI tts-1 | ⭐⭐⭐ | ⚡⚡⚡ | $15 | General use |
| OpenAI tts-1-hd | ⭐⭐⭐⭐ | ⚡⚡ | $30 | High quality |
| ElevenLabs | ⭐⭐⭐⭐⭐ | ⚡⚡ | $30-120 | Premium |
| Google Standard | ⭐⭐⭐ | ⚡⚡ | $4 | Budget |
| Google WaveNet | ⭐⭐⭐⭐ | ⚡⚡ | $16 | Value |
| Azure Neural | ⭐⭐⭐⭐ | ⚡⚡ | $15 | Enterprise |
Prices are approximate and may vary. Check provider websites for current pricing.