Real-time transcription, on your PC, in ~200 ms.

Whisper large-v3-turbo with CUDA acceleration + Silero VAD. Global F8 hotkey, minimalist floating pill, hybrid paste (clipboard + SendInput) that works even in stubborn apps. Zero network — audio never leaves your machine.

View source ↗

fig. 1 · JarvsTranscript · settings window

§ 01

why

Audio transcription needed to be cheaper than a cURL call.

Otter, Rev, Google Speech, any cloud API — they all demand network, account, and upload latency. Whisper.cpp is local but runs on slow CPU. We wanted: press F8, speak, release — and in ~200 ms the text appears in whatever app has focus. No cloud, no login, no upload. Your local GPU does the heavy lifting.

§ 02

features

What this product actually does.

⌥ № 01

Global F8 hotkey

Press it from anywhere — Discord, Slack, IDE, browser. Push-to-talk (hold) or toggle (on/off), your call.

~ № 02

Real-time partials

Sliding window of 1.5s with 200 ms overlap. You see the text being transcribed as you speak, not just at the end — live feedback without stutter.

◌ № 03

Automatic Silero VAD

In toggle mode, Silero VAD v5 detects natural end-of-speech and closes the session by itself. RMS-VAD fallback if Silero ever fails.

⌘ № 04

Hybrid paste

Clipboard primary + native SendInput fallback. Works even in apps that refuse Ctrl+V — some terminals, some IMEs, some RDP clients.

◍ № 05

Floating pill

420×72 px of UI. Animated VU meter, glass blur, lives in a screen corner. Doesn’t steal focus, doesn’t block clicks below.

⟲ № 06

Hot-reload settings

Swap model, device (CPU/GPU) or mode (PTT vs toggle) without restarting. Persisted in a versioned `data/config.json`.

§ 03

under the hood

Why these technical choices.

01 faster-whisper · CTranslate2 · CUDA

Local GPU, state-of-the-art model

faster-whisper runs Whisper on top of CTranslate2 with native FP16 on the NVIDIA GPU. Sub-unit real-time factor on large-v3-turbo — faster than speaking.

02 Silero VAD v5

Speech detection in ~2 MB

Lightweight deep learning model with 95%+ accuracy in PT-BR and EN. Decides when you stopped talking to close the session automatically.

03 Tauri 2 · Rust · WinAPI

Native Windows shell

Global hotkey via Win32. Clipboard via arboard. SendInput via enigo. Captures the focused window HWND before paste — guarantees the right characters land in the right app.

04 Python async · websockets

Async pipeline on localhost

Python asyncio backend serving websockets at 127.0.0.1:7979. Tauri frontend speaks a versioned, documented protocol. Zero outbound network, full stop.

§ 04

specs

The spec sheet.

Latency (release): ~200–800 ms
Default model: large-v3-turbo · ~1.5 GB
Audio format: PCM float32 · 16 kHz · mono
Platforms: Windows 11 · NVIDIA CUDA 12+
License: MIT
Current version: v0.0.1 · pre-release
Test suite: 183 verdes · 31 Rust · 48 Vitest · 104 pytest

§ 05

roadmap

Where we are · where we’re going.

[×] GPU STT (faster-whisper) done
[×] Global F8 hotkey done
[×] PTT + toggle modes done
[×] Silero VAD + RMS fallback done
[×] Hybrid paste done
[×] Floating pill done
[×] Settings UI with hot-reload done
[×] Single-instance + autostart done
[ ] User-customizable hotkey planned
[ ] PT-BR polish (punctuation + capitalization) planned

§ 06 · next The next COSLU product. COSLU Reader →