← all apps v0.0.1 · released
JarvsTranscript // built by coslu labz

Real-time transcription, on your PC, in ~200 ms.

Whisper large-v3-turbo with CUDA acceleration + Silero VAD. Global F8 hotkey, minimalist floating pill, hybrid paste (clipboard + SendInput) that works even in stubborn apps. Zero network — audio never leaves your machine.

JarvsTranscript · settings window
fig. 1 · JarvsTranscript · settings window
§ 01
why

Audio transcription needed to be cheaper than a cURL call.

Otter, Rev, Google Speech, any cloud API — they all demand network, account, and upload latency. Whisper.cpp is local but runs on slow CPU. We wanted: press F8, speak, release — and in ~200 ms the text appears in whatever app has focus. No cloud, no login, no upload. Your local GPU does the heavy lifting.

§ 02
features

What this product actually does.

№ 01

Global F8 hotkey

Press it from anywhere — Discord, Slack, IDE, browser. Push-to-talk (hold) or toggle (on/off), your call.

~ № 02

Real-time partials

Sliding window of 1.5s with 200 ms overlap. You see the text being transcribed as you speak, not just at the end — live feedback without stutter.

№ 03

Automatic Silero VAD

In toggle mode, Silero VAD v5 detects natural end-of-speech and closes the session by itself. RMS-VAD fallback if Silero ever fails.

№ 04

Hybrid paste

Clipboard primary + native SendInput fallback. Works even in apps that refuse Ctrl+V — some terminals, some IMEs, some RDP clients.

№ 05

Floating pill

420×72 px of UI. Animated VU meter, glass blur, lives in a screen corner. Doesn’t steal focus, doesn’t block clicks below.

№ 06

Hot-reload settings

Swap model, device (CPU/GPU) or mode (PTT vs toggle) without restarting. Persisted in a versioned `data/config.json`.

§ 03
under the hood

Why these technical choices.

01 faster-whisper · CTranslate2 · CUDA

Local GPU, state-of-the-art model

faster-whisper runs Whisper on top of CTranslate2 with native FP16 on the NVIDIA GPU. Sub-unit real-time factor on large-v3-turbo — faster than speaking.

02 Silero VAD v5

Speech detection in ~2 MB

Lightweight deep learning model with 95%+ accuracy in PT-BR and EN. Decides when you stopped talking to close the session automatically.

03 Tauri 2 · Rust · WinAPI

Native Windows shell

Global hotkey via Win32. Clipboard via arboard. SendInput via enigo. Captures the focused window HWND before paste — guarantees the right characters land in the right app.

04 Python async · websockets

Async pipeline on localhost

Python asyncio backend serving websockets at 127.0.0.1:7979. Tauri frontend speaks a versioned, documented protocol. Zero outbound network, full stop.

§ 04
specs

The spec sheet.

Latency (release)
~200–800 ms
Default model
large-v3-turbo · ~1.5 GB
Audio format
PCM float32 · 16 kHz · mono
Platforms
Windows 11 · NVIDIA CUDA 12+
License
MIT
Current version
v0.0.1 · pre-release
Test suite
183 verdes · 31 Rust · 48 Vitest · 104 pytest
§ 05
roadmap

Where we are · where we’re going.

  • [×] GPU STT (faster-whisper) done
  • [×] Global F8 hotkey done
  • [×] PTT + toggle modes done
  • [×] Silero VAD + RMS fallback done
  • [×] Hybrid paste done
  • [×] Floating pill done
  • [×] Settings UI with hot-reload done
  • [×] Single-instance + autostart done
  • [ ] User-customizable hotkey planned
  • [ ] PT-BR polish (punctuation + capitalization) planned
§ 06 · next The next COSLU product. COSLU Reader