MODULE_02 // VOICE INFRASTRUCTURE

RealtimeSTT

Production Voice Pipeline

Role
Maintainer
Period
2024–2026
Status
PRODUCTION
Domain
Voice Infrastructure

// STACK_MANIFEST

PythonFaster-WhisperWebRTC VADSilero VADPorcupineFastAPIMultiprocessingAnthropic SDK

// OUTCOME_METRICS

PyPI Version
0.3.104+
Source Files
467
Latency
<300ms

Problem

Most speech-to-text libraries are either real-time (fast but low accuracy) or accurate (slow, batch-oriented). Voice agents need both: partial transcriptions streaming in real-time so the user sees their words as they speak, and a high-accuracy final transcription for the agent to actually respond to.

The existing Python options each solved one half of this. I needed one library that solved both, with voice activity detection, wake word support, and a clean enough API to drop into any agent.

Approach

RealtimeSTT is a multi-process Python library I maintain — forked from KoljaB's original work with custom integrations for the Hardcore agent stack. 467 source files, a 2,800-line AudioToTextRecorder core, and a dual-model transcription pipeline that runs a lightweight model for partials and a heavy model for finals, concurrently.

// ARCHITECTUREProduction Voice Pipeline
MIC INPUTPyAudioWEBRTC VADframe-level gateSILERO VADmodel gateWAKE WORDPorcupine · OpenWWSAFEPIPEipc channelRECORDER PROCring buffer · chunkingREALTIME MODELFaster-Whisper tinyMAIN MODELFaster-Whisper largeCALLBACKSpartial + finalWEBSOCKET SERVERFastAPIWEB UIbrowser clientANTHROPIC SDKClaude integrationTEXT OUT<300ms latency

Build Notes

Dual-model transcription. A lightweight Faster-Whisper tiny model generates partial transcriptions every ~100ms for responsive UI feedback. In parallel, a Faster-Whisper large model waits for the VAD to close the utterance, then produces the final high-accuracy transcription. Both share the same audio buffer via SafePipe.

Layered voice activity detection. WebRTC VAD does the cheap frame-level gate. Silero VAD runs a neural model for harder cases (background noise, non-speech audio). The two work in sequence — WebRTC rejects the easy noise, Silero catches what slips through.

Wake word as a first-class input. Porcupine and OpenWakeWord are integrated as alternative recording triggers. The recorder can start on VAD (always-listening mode) or on wake word ("Hey Hardcore…") with a single config flag.

Multiprocess architecture with SafePipe. The recorder, the realtime model, and the main model each run in their own process. A custom SafePipe IPC layer moves audio and text between them without blocking the main thread. This is the reason the library can hit sub-300ms partial latency on consumer hardware.

FastAPI + WebSocket server + web UI. Shipped as a ready-to-run service. Clone the repo, run the server, open the HTML client, and you have a working voice interface. That's how the Anthropic SDK integration I added hooks into Claude for end-to-end voice-driven assistants.

PyPI published, versioned, documented. Installable via pip install RealtimeSTT. Versioned at 0.3.104+. Live on PyPI with download stats and release notes.

Results

Open-source library used in my own voice agent stack and available for anyone else who needs sub-300ms streaming transcription with high-accuracy finals. The multiprocess architecture is the pattern I reach for whenever a project needs real-time audio + heavy ML in the same Python process.