Skip to main content

Input & Output Modalities

Modalities allow your experts to interact through multiple channels beyond text. Configure voice input (speech-to-text) and audio output (text-to-speech) to create rich, multimodal conversational experiences.

Overview

B-Bot Hub supports:
  • Input Modalities: How users communicate with your expert (voice, text, files)
  • Output Modalities: How your expert responds (text, voice, images)

Accessing Modality Settings

Configure modalities when:
  • Creating an expert (Step 5: Models)
  • Creating an assistant (Model configuration step)
  • In chat settings (during conversations)
Modalities Configuration

Input Modalities

Voice Input (Speech-to-Text)

Enable voice input to allow users to speak to your expert instead of typing.
1

Enable Voice Input

Click Configure Input Modalities in the model selection step
2

Select Provider

Choose your speech-to-text provider:
  • Browser Native: Uses the browser’s built-in speech recognition (free)
  • OpenAI Whisper: High-accuracy transcription
  • Google Speech-to-Text: Multi-language support
  • Azure Speech: Enterprise-grade recognition
3

Configure Settings

  • Language: Select primary language for recognition
  • Continuous: Enable continuous listening mode
  • Interim Results: Show transcription as user speaks
4

Test

Use the test button to verify voice input is working

Supported Input Types

Voice

Real-time voice recording and transcription

Files

Upload documents, images, audio files

Text

Traditional text input (always available)

Output Modalities

Text-to-Speech (Audio Output)

Enable audio output to have your expert speak responses aloud.
1

Open Output Modalities

Click Configure Output Modalities in the model settings
2

Choose TTS Provider

Select from available providers:
  • OpenAI TTS: Natural-sounding voices with emotion
  • ElevenLabs: Ultra-realistic voice synthesis
  • Google TTS: WaveNet voices, multi-language
  • Azure Speech: Neural voices with customization
  • Browser Native: Built-in browser synthesis (free, basic)
3

Select Voice

Each provider offers different voices:
  • OpenAI: Alloy, Echo, Fable, Onyx, Nova, Shimmer
  • ElevenLabs: 100+ premium voices
  • Google: Standard and WaveNet voices
  • Azure: Neural voices in 100+ languages
4

Configure Playback

  • Auto-play: Automatically play audio when response completes
  • Streaming TTS: Stream audio as text generates (where supported)
  • Speed: Adjust playback speed (0.5x to 2.0x)

Voice Configuration

  • OpenAI TTS
  • ElevenLabs
  • Google TTS
  • Azure Speech
  • Browser Native
Models:
  • tts-1: Standard quality, fast
  • tts-1-hd: High definition, slower
Voices:
  • Alloy: Neutral, balanced
  • Echo: Male, clear
  • Fable: British accent, expressive
  • Onyx: Deep, authoritative
  • Nova: Female, energetic
  • Shimmer: Soft, warm
Features:
  • Real-time streaming
  • Multiple languages
  • Emotion in voice
  • Fast generation
Best for: General use, conversational AI

Advanced Configuration

API Key Selection

You can use different API keys for TTS than your main model key!
For example:
Main Model: GPT-4 (Production OpenAI key)
Voice Output: ElevenLabs (Personal ElevenLabs key)
This allows you to:
  • Separate billing for different services
  • Use specialized accounts for voice
  • Manage rate limits independently

Streaming TTS

Streaming TTS generates and plays audio as the text is being generated, rather than waiting for the complete response.Benefits:
  • Faster time-to-first-audio
  • More natural conversation flow
  • Better user experience
  • Reduced perceived latency
Currently streaming TTS is supported by:
  • ✅ OpenAI TTS
  • ✅ ElevenLabs (with turbo models)
  • ⚠️ Google TTS (partial support)
  • ❌ Azure Speech (coming soon)
  • ❌ Browser Native (not supported)
  1. Open Output Modalities configuration
  2. Select a supported provider
  3. Toggle “Stream audio as text generates”
  4. Save configuration
The audio will now start playing before the full response is complete.

Auto-Play Settings

  • Always On
  • User Controlled
Audio plays automatically for every response.Best for:
  • Voice-first applications
  • Accessibility features
  • Hands-free use cases
  • Customer service bots

Using Voice in Chat

Voice Input

1

Enable Voice Mode

Click the microphone icon in the chat input area to enable voice mode
2

Hold to Record

Press and hold the microphone button while speaking
3

Release to Send

Release the button when done. Your speech will be transcribed and sent automatically.
Tips for better recognition:
  • Speak clearly and at a normal pace
  • Minimize background noise
  • Use a good microphone
  • Wait for the transcription to complete

Audio Output

When TTS is enabled:
  1. Expert’s text response appears as normal
  2. Audio player appears below the message
  3. If auto-play is on, audio starts automatically
  4. Controls available: play/pause, speed, volume

Multimodal Content

Your experts can handle multiple content types in a single message:

Voice + Text

User speaks a question, expert responds with text and audio

Image + Voice

User uploads an image and asks about it via voice

File + Text

User uploads a document and types a question

Mixed Media

Combine any input types in a single interaction

Best Practices

Voice Selection

Choose voices that align with your expert’s role:
  • Professional: Clear, authoritative (Onyx, Echo)
  • Friendly: Warm, approachable (Nova, Shimmer)
  • Technical: Neutral, precise (Alloy)
  • Customer Service: Empathetic, patient (Fable)
  • Global: Use multi-language TTS providers
  • Accessibility: Enable voice input and output by default
  • Professional: Use high-quality voices (ElevenLabs, Azure Neural)
  • Cost-Conscious: Browser native or OpenAI standard
Voice performance varies:
  • Desktop browsers have better native support
  • Mobile may have data usage considerations
  • Test auto-play on mobile (may be blocked)
  • Consider bandwidth limitations

Performance Optimization

Speed vs Quality

Fast (tts-1, turbo):
  • Lower latency
  • Good for chat
  • Less compute intensive
High Quality (tts-1-hd, neural):
  • Better sound
  • More natural
  • Slightly slower

Streaming vs Buffered

Streaming:
  • Faster start
  • Better UX
  • More complex
Buffered:
  • Complete audio
  • Simpler
  • Small delay

Troubleshooting

Check:
  1. Browser permissions granted?
  2. Microphone connected and working?
  3. Try browser native option first
  4. Check browser console for errors
Common fixes:
  • Reload page and grant permissions
  • Check system microphone settings
  • Try a different browser
  • Use HTTPS (required for mic access)
Check:
  1. Volume not muted?
  2. Auto-play enabled?
  3. Provider API key valid?
  4. Browser allows audio playback?
Common fixes:
  • Click play button manually
  • Check provider key in settings
  • Try different TTS provider
  • Check browser audio settings
Solutions:
  • Switch to HD model (tts-1-hd)
  • Try ElevenLabs for premium quality
  • Use neural voices (Azure, Google)
  • Check internet connection speed
  • Reduce playback speed if garbled
Cost-saving tips:
  • Use browser native for testing
  • Choose standard models over HD
  • Disable auto-play (user controlled)
  • Use OpenAI over ElevenLabs for lower cost
  • Monitor usage in provider dashboard

Cost Comparison

ProviderQualitySpeedCost (per 1M chars)Best For
Browser⭐⭐⚡⚡⚡FreeTesting, demos
OpenAI tts-1⭐⭐⭐⚡⚡⚡$15General use
OpenAI tts-1-hd⭐⭐⭐⭐⚡⚡$30High quality
ElevenLabs⭐⭐⭐⭐⭐⚡⚡$30-120Premium
Google Standard⭐⭐⭐⚡⚡$4Budget
Google WaveNet⭐⭐⭐⭐⚡⚡$16Value
Azure Neural⭐⭐⭐⭐⚡⚡$15Enterprise
Prices are approximate and may vary. Check provider websites for current pricing.

Next Steps