MODULE_02 // VOICE INFRASTRUCTURE
RealtimeSTT
Production Voice Pipeline
- Role
- Maintainer
- Period
- 2024–2026
- Status
- PRODUCTION
- Domain
- Voice Infrastructure
// STACK_MANIFEST
// OUTCOME_METRICS
Problem
Most speech-to-text libraries are either real-time (fast but low accuracy) or accurate (slow, batch-oriented). Voice agents need both: partial transcriptions streaming in real-time so the user sees their words as they speak, and a high-accuracy final transcription for the agent to actually respond to.
The existing Python options each solved one half of this. I needed one library that solved both, with voice activity detection, wake word support, and a clean enough API to drop into any agent.
Approach
RealtimeSTT is a multi-process Python library I maintain — forked
from KoljaB's original work with custom integrations for the
Hardcore agent stack. 467 source files, a 2,800-line
AudioToTextRecorder core, and a dual-model transcription pipeline
that runs a lightweight model for partials and a heavy model for
finals, concurrently.
Build Notes
Dual-model transcription. A lightweight Faster-Whisper tiny model generates partial transcriptions every ~100ms for responsive UI feedback. In parallel, a Faster-Whisper large model waits for the VAD to close the utterance, then produces the final high-accuracy transcription. Both share the same audio buffer via SafePipe.
Layered voice activity detection. WebRTC VAD does the cheap frame-level gate. Silero VAD runs a neural model for harder cases (background noise, non-speech audio). The two work in sequence — WebRTC rejects the easy noise, Silero catches what slips through.
Wake word as a first-class input. Porcupine and OpenWakeWord are integrated as alternative recording triggers. The recorder can start on VAD (always-listening mode) or on wake word ("Hey Hardcore…") with a single config flag.
Multiprocess architecture with SafePipe. The recorder, the realtime model, and the main model each run in their own process. A custom SafePipe IPC layer moves audio and text between them without blocking the main thread. This is the reason the library can hit sub-300ms partial latency on consumer hardware.
FastAPI + WebSocket server + web UI. Shipped as a ready-to-run service. Clone the repo, run the server, open the HTML client, and you have a working voice interface. That's how the Anthropic SDK integration I added hooks into Claude for end-to-end voice-driven assistants.
PyPI published, versioned, documented. Installable via
pip install RealtimeSTT. Versioned at 0.3.104+. Live on PyPI
with download stats and release notes.
Results
Open-source library used in my own voice agent stack and available for anyone else who needs sub-300ms streaming transcription with high-accuracy finals. The multiprocess architecture is the pattern I reach for whenever a project needs real-time audio + heavy ML in the same Python process.