Input & Output Modalities

Modalities allow your experts to interact through multiple channels beyond text. Configure voice input (speech-to-text) and audio output (text-to-speech) to create rich, multimodal conversational experiences.

Overview

B-Bot Hub supports:

Input Modalities: How users communicate with your expert (voice, text, files)
Output Modalities: How your expert responds (text, voice, images)

Accessing Modality Settings

Configure modalities when:

Creating an expert (Step 5: Models)
Creating an assistant (Model configuration step)
In chat settings (during conversations)

Input Modalities

Voice Input (Speech-to-Text)

Enable voice input to allow users to speak to your expert instead of typing.

Enable Voice Input

Click Configure Input Modalities in the model selection step

Select Provider

Choose your speech-to-text provider:

Browser Native: Uses the browser’s built-in speech recognition (free)
OpenAI Whisper: High-accuracy transcription
Google Speech-to-Text: Multi-language support
Azure Speech: Enterprise-grade recognition

Configure Settings

Language: Select primary language for recognition
Continuous: Enable continuous listening mode
Interim Results: Show transcription as user speaks

Test

Use the test button to verify voice input is working

Supported Input Types

Voice

Real-time voice recording and transcription

Files

Upload documents, images, audio files

Text

Traditional text input (always available)

Output Modalities

Text-to-Speech (Audio Output)

Enable audio output to have your expert speak responses aloud.

Open Output Modalities

Click Configure Output Modalities in the model settings

Choose TTS Provider

Select from available providers:

OpenAI TTS: Natural-sounding voices with emotion
ElevenLabs: Ultra-realistic voice synthesis
Google TTS: WaveNet voices, multi-language
Azure Speech: Neural voices with customization
Browser Native: Built-in browser synthesis (free, basic)

Select Voice

Each provider offers different voices:

OpenAI: Alloy, Echo, Fable, Onyx, Nova, Shimmer
ElevenLabs: 100+ premium voices
Google: Standard and WaveNet voices
Azure: Neural voices in 100+ languages

Configure Playback

Auto-play: Automatically play audio when response completes
Streaming TTS: Stream audio as text generates (where supported)
Speed: Adjust playback speed (0.5x to 2.0x)

Voice Configuration

OpenAI TTS
ElevenLabs
Google TTS
Azure Speech
Browser Native

Models:

tts-1: Standard quality, fast
tts-1-hd: High definition, slower

Voices:

Alloy: Neutral, balanced
Echo: Male, clear
Fable: British accent, expressive
Onyx: Deep, authoritative
Nova: Female, energetic
Shimmer: Soft, warm

Features:

Real-time streaming
Multiple languages
Emotion in voice
Fast generation

Best for: General use, conversational AI

Models:

eleven_multilingual_v2: Best quality, 29 languages
eleven_turbo_v2: Fast, English only
eleven_monolingual_v1: Classic, English

Voices:

100+ pre-made voices
Voice cloning available
Custom voice profiles
Professional actors

Features:

Highest quality synthesis
Emotional range
Voice consistency
Multi-language support

Best for: Premium experiences, customer-facing applications

Advanced Configuration

API Key Selection

You can use different API keys for TTS than your main model key!

For example:

Main Model: GPT-4 (Production OpenAI key)
Voice Output: ElevenLabs (Personal ElevenLabs key)

This allows you to:

Separate billing for different services
Use specialized accounts for voice
Manage rate limits independently

Streaming TTS

What is Streaming TTS?

Streaming TTS generates and plays audio as the text is being generated, rather than waiting for the complete response.Benefits:

Faster time-to-first-audio
More natural conversation flow
Better user experience
Reduced perceived latency

Supported Providers

Currently streaming TTS is supported by:

✅ OpenAI TTS
✅ ElevenLabs (with turbo models)
⚠️ Google TTS (partial support)
❌ Azure Speech (coming soon)
❌ Browser Native (not supported)

How to Enable

Open Output Modalities configuration
Select a supported provider
Toggle “Stream audio as text generates”
Save configuration

The audio will now start playing before the full response is complete.

Auto-Play Settings

Always On
User Controlled

Audio plays automatically for every response.Best for:

Voice-first applications
Accessibility features
Hands-free use cases
Customer service bots

Using Voice in Chat

Voice Input

Enable Voice Mode

Click the microphone icon in the chat input area to enable voice mode

Hold to Record

Press and hold the microphone button while speaking

Release to Send

Release the button when done. Your speech will be transcribed and sent automatically.

Tips for better recognition:

Speak clearly and at a normal pace
Minimize background noise
Use a good microphone
Wait for the transcription to complete

Audio Output

When TTS is enabled:

Expert’s text response appears as normal
Audio player appears below the message
If auto-play is on, audio starts automatically
Controls available: play/pause, speed, volume

Multimodal Content

Your experts can handle multiple content types in a single message:

Voice + Text

User speaks a question, expert responds with text and audio

Image + Voice

User uploads an image and asks about it via voice

File + Text

User uploads a document and types a question

Mixed Media

Combine any input types in a single interaction

Best Practices

Voice Selection

Match Expert Personality

Choose voices that align with your expert’s role:

Professional: Clear, authoritative (Onyx, Echo)
Friendly: Warm, approachable (Nova, Shimmer)
Technical: Neutral, precise (Alloy)
Customer Service: Empathetic, patient (Fable)

Consider Your Audience

Global: Use multi-language TTS providers
Accessibility: Enable voice input and output by default
Professional: Use high-quality voices (ElevenLabs, Azure Neural)
Cost-Conscious: Browser native or OpenAI standard

Test Across Devices

Voice performance varies:

Desktop browsers have better native support
Mobile may have data usage considerations
Test auto-play on mobile (may be blocked)
Consider bandwidth limitations

Performance Optimization

Speed vs Quality

Fast (tts-1, turbo):

Lower latency
Good for chat
Less compute intensive

High Quality (tts-1-hd, neural):

Better sound
More natural
Slightly slower

Streaming vs Buffered

Streaming:

Faster start
Better UX
More complex

Buffered:

Complete audio
Simpler
Small delay

Troubleshooting

Microphone Not Working

Check:

Browser permissions granted?
Microphone connected and working?
Try browser native option first
Check browser console for errors

Common fixes:

Reload page and grant permissions
Check system microphone settings
Try a different browser
Use HTTPS (required for mic access)

Audio Not Playing

Check:

Volume not muted?
Auto-play enabled?
Provider API key valid?
Browser allows audio playback?

Common fixes:

Click play button manually
Check provider key in settings
Try different TTS provider
Check browser audio settings

Poor Voice Quality

Solutions:

Switch to HD model (tts-1-hd)
Try ElevenLabs for premium quality
Use neural voices (Azure, Google)
Check internet connection speed
Reduce playback speed if garbled

High Costs

Cost-saving tips:

Use browser native for testing
Choose standard models over HD
Disable auto-play (user controlled)
Use OpenAI over ElevenLabs for lower cost
Monitor usage in provider dashboard

Cost Comparison

Provider	Quality	Speed	Cost (per 1M chars)	Best For
Browser	⭐⭐	⚡⚡⚡	Free	Testing, demos
OpenAI tts-1	⭐⭐⭐	⚡⚡⚡	$15	General use
OpenAI tts-1-hd	⭐⭐⭐⭐	⚡⚡	$30	High quality
ElevenLabs	⭐⭐⭐⭐⭐	⚡⚡	$30-120	Premium
Google Standard	⭐⭐⭐	⚡⚡	$4	Budget
Google WaveNet	⭐⭐⭐⭐	⚡⚡	$16	Value
Azure Neural	⭐⭐⭐⭐	⚡⚡	$15	Enterprise

Prices are approximate and may vary. Check provider websites for current pricing.

Next Steps

Provider Keys

Set up API keys for voice providers

Create Expert

Create an expert with voice capabilities

Chat Features

Learn about using voice in conversations

Custom Models

Configure custom voice models

Getting Started

Experts

Chat

Training & Knowledge

Apps & Integrations

DeepAgents

Distribution

Account & Settings

Advanced Features

Declarations

​Input & Output Modalities

​Overview

​Accessing Modality Settings

​Input Modalities

​Voice Input (Speech-to-Text)

​Supported Input Types

Voice

Files

Text

​Output Modalities

​Text-to-Speech (Audio Output)

​Voice Configuration

​Advanced Configuration

​API Key Selection

​Streaming TTS

​Auto-Play Settings

​Using Voice in Chat

​Voice Input

​Audio Output

​Multimodal Content

Voice + Text

Image + Voice

File + Text

Mixed Media

​Best Practices

​Voice Selection

​Performance Optimization

Speed vs Quality

Streaming vs Buffered

​Troubleshooting

​Cost Comparison

​Next Steps

Provider Keys

Create Expert

Chat Features

Custom Models

Input & Output Modalities

Overview

Accessing Modality Settings

Input Modalities

Voice Input (Speech-to-Text)

Supported Input Types

Output Modalities

Text-to-Speech (Audio Output)

Voice Configuration

Advanced Configuration

API Key Selection

Streaming TTS

Auto-Play Settings

Using Voice in Chat

Voice Input

Audio Output

Multimodal Content

Best Practices

Voice Selection

Performance Optimization

Troubleshooting

Cost Comparison

Next Steps