Engines
ElevenLabs (Cloud)
High-quality TTS and STT via API. Requires
ELEVENLABS_API_KEY.Local Commands
Any local STT/TTS tool — Whisper, Kokoro, Piper, Coqui, etc. Runs as a subprocess via configurable command.
Quick Setup
- ElevenLabs (default)
- Local Whisper + Kokoro
- Mixed (Whisper + ElevenLabs)
No engine config needed — just add your API key:
Subprocess Protocol
Local engines are invoked as subprocesses with a simple file-based interface. This means you can use any STT/TTS tool — just wrap it in a script that follows the protocol.STT (Speech-to-Text)
- Octo writes the audio to a temp
.oggfile and appends the path as the last argument - Your script reads the audio file and prints transcribed text to stdout
- Exit code 0 = success
TTS (Text-to-Speech)
- Octo creates a temp
.wavpath and appends it as the last argument - Text to synthesize is sent via stdin
- Your script reads stdin, generates audio, and writes it to the output file
- Exit code 0 = success
The configurable command approach gives maximum flexibility — your scripts handle model selection, voice choice, language, speed, and any other engine-specific options. Octo doesn’t need to know about those details.
Example Scripts
Whisper STT (transcribe.py):
synthesize.py):
Voiceover Text Preparation
Raw AI responses often contain markdown, file paths, code blocks, and URLs — these sound unnatural when spoken aloud. Octo automatically rewrites text into natural speech before synthesis.Voiceover prep (LLM rewrite)
A cheap model rewrites the text as a natural voiceover script — removing formatting, summarizing technical details, keeping the key message.
synthesize(text, prep=False).
Usage
CLI
Toggle voice mode:Telegram
When using the Telegram transport:- Incoming voice messages are transcribed using the configured STT engine
- Responses are sent back as voice messages (with voiceover prep)
- Text responses are always sent alongside voice for accessibility
Configuration Reference
| Variable | Default | Description |
|---|---|---|
VOICE_STT_ENGINE | elevenlabs | STT engine: elevenlabs or whisper |
VOICE_TTS_ENGINE | elevenlabs | TTS engine: elevenlabs or kokoro |
WHISPER_COMMAND | — | Full command for local STT (e.g. python transcribe.py) |
KOKORO_COMMAND | — | Full command for local TTS (e.g. python synthesize.py) |
ELEVENLABS_API_KEY | — | ElevenLabs API key (required for cloud engines) |
ELEVENLABS_VOICE_ID | — | ElevenLabs voice ID (optional, uses default) |
Timeouts
| Engine | Timeout | Notes |
|---|---|---|
| STT (local command) | 120s | Whisper on base model typically takes 5-15s |
| TTS (local command) | 180s | Depends on text length and model size |
| Voiceover prep | LLM default | Uses LOW tier model for cost efficiency |

