Real-time transcription, on your PC, in ~200 ms.
Whisper large-v3-turbo with CUDA acceleration + Silero VAD. Global F8 hotkey, minimalist floating pill, hybrid paste (clipboard + SendInput) that works even in stubborn apps. Zero network — audio never leaves your machine.
Audio transcription needed to be cheaper than a cURL call.
Otter, Rev, Google Speech, any cloud API — they all demand network, account, and upload latency. Whisper.cpp is local but runs on slow CPU. We wanted: press F8, speak, release — and in ~200 ms the text appears in whatever app has focus. No cloud, no login, no upload. Your local GPU does the heavy lifting.
What this product actually does.
Global F8 hotkey
Press it from anywhere — Discord, Slack, IDE, browser. Push-to-talk (hold) or toggle (on/off), your call.
Real-time partials
Sliding window of 1.5s with 200 ms overlap. You see the text being transcribed as you speak, not just at the end — live feedback without stutter.
Automatic Silero VAD
In toggle mode, Silero VAD v5 detects natural end-of-speech and closes the session by itself. RMS-VAD fallback if Silero ever fails.
Hybrid paste
Clipboard primary + native SendInput fallback. Works even in apps that refuse Ctrl+V — some terminals, some IMEs, some RDP clients.
Floating pill
420×72 px of UI. Animated VU meter, glass blur, lives in a screen corner. Doesn’t steal focus, doesn’t block clicks below.
Hot-reload settings
Swap model, device (CPU/GPU) or mode (PTT vs toggle) without restarting. Persisted in a versioned `data/config.json`.
Why these technical choices.
Local GPU, state-of-the-art model
faster-whisper runs Whisper on top of CTranslate2 with native FP16 on the NVIDIA GPU. Sub-unit real-time factor on large-v3-turbo — faster than speaking.
Speech detection in ~2 MB
Lightweight deep learning model with 95%+ accuracy in PT-BR and EN. Decides when you stopped talking to close the session automatically.
Native Windows shell
Global hotkey via Win32. Clipboard via arboard. SendInput via enigo. Captures the focused window HWND before paste — guarantees the right characters land in the right app.
Async pipeline on localhost
Python asyncio backend serving websockets at 127.0.0.1:7979. Tauri frontend speaks a versioned, documented protocol. Zero outbound network, full stop.
The spec sheet.
- Latency (release)
- ~200–800 ms
- Default model
- large-v3-turbo · ~1.5 GB
- Audio format
- PCM float32 · 16 kHz · mono
- Platforms
- Windows 11 · NVIDIA CUDA 12+
- License
- MIT
- Current version
- v0.0.1 · pre-release
- Test suite
- 183 verdes · 31 Rust · 48 Vitest · 104 pytest
Where we are · where we’re going.
- [×] GPU STT (faster-whisper) done
- [×] Global F8 hotkey done
- [×] PTT + toggle modes done
- [×] Silero VAD + RMS fallback done
- [×] Hybrid paste done
- [×] Floating pill done
- [×] Settings UI with hot-reload done
- [×] Single-instance + autostart done
- [ ] User-customizable hotkey planned
- [ ] PT-BR polish (punctuation + capitalization) planned