I Needed Voice-to-Text on Linux. I Built One in Under an Hour With AI.
I switched to Linux and lost my voice-to-text tools. So I asked Claude Code to build one. Working product in 50 minutes.

I talk to my computer constantly. Not in a “hey Siri” way. Full paragraphs, long prompts, entire thoughts that I don’t want to type out character by character. Voice-to-text is how I work. So when I switched from macOS to Linux, the first thing I noticed was the gap.
How can I survive without talking to the IDEs and saying the things that I want and then they just type for me?
That’s not an exaggeration. It genuinely felt like going back two years.
The problem
On macOS, I had tried SuperWhisper (free tier with a local model, but the quality wasn’t great) and then switched to Wispr Flow, which was way better. I was paying about R$24/month for it. Push-to-talk voice input: hold a button, talk, release, and the transcribed text appears wherever your cursor is. Simple, fast, reliable.
I had a laptop sitting around and wanted an isolated environment to run some bots and test local AI models. I set up Linux on it (Pop!_OS, for its native NVIDIA driver support). But once I started working on it, I realized: neither SuperWhisper nor Wispr Flow exists for Linux.
I looked around. There are alternatives. OpenWhispr is actually pretty decent, and there’s Nerd Dictation and a few other projects. I could have spent time evaluating and configuring them. But I had a different question in my head:
How fast could I build exactly what I want with AI?
The experiment
So instead of evaluating alternatives, I opened Claude Code and typed one sentence:
Can we create something similar (and cheap or free) for Linux? I just want to press a button and say something in English or Portuguese and have it paste when I release the button.
That was the entire spec. I didn’t write requirements, didn’t draw diagrams, didn’t plan sprints. I figured if AI coding tools are as good as people say, let’s see what happens when you just describe what you need and hit enter.
One thing worth noting: I wasn’t sitting there watching the screen. I was running evals on another project and working on a third thing at the same time. This was a background experiment.
Working product in 50 minutes
Here’s how it actually went:

The implementation (planning, writing Go code, compiling, passing tests) took 23 minutes. That’s a complete push-to-talk voice transcription tool with audio capture, API integration, clipboard management, and global hotkey support.
But then I tried to use it. And nothing pasted.
The Wayland paste saga
This is where the story gets interesting, because this is what AI-assisted development actually looks like. Not the Twitter demos where everything works on the first try.
Here’s the thing about Pop!_OS: it runs the COSMIC desktop on Wayland. This is a cutting-edge compositor, and most Linux tools for simulating keyboard input still assume X11. Here’s what we tried, in order:
- xdotool — X11 only. Silent failure on Wayland.
- ydotool — Needs ydotoold daemon running. Wasn’t.
- wtype — Needs wlr-virtual-keyboard protocol. COSMIC doesn’t support it.
- dotool — Compiled from source. Same protocol issues.
- ydotool type — ASCII only. Garbled any UTF-8 text (which I need for Portuguese).
Five tools. All failed. Each one for a different reason.
The breakthrough came when all the queued test pastes suddenly appeared at once. Turns out COSMIC needs about 2 seconds to recognize a new virtual keyboard before accepting input from it.
The solution: pre-warm a uinput virtual keyboard at startup using Go’s bendahl/uinput library, give COSMIC 2 seconds to register it, then use wl-copy for the clipboard and simulate Ctrl+Shift+V through the pre-warmed keyboard. It’s not elegant. It works.
The whole debugging loop (“is it pasting?” “no” “let me try ydotool” “still no” “what about wtype?” “nope”) took 26 minutes. Almost as long as writing the entire application. And the human-in-the-loop was essential here. Claude couldn’t test if paste was working. I had to be the one saying “it’s not appearing on screen, try something else.”
The Groq moment
The first version used a local whisper-base model (147MB). The transcription quality was… creative:
What I said: “Can it detect if it’s a terminal window or not and paste it” What it heard: “If the tech defeats a turn my window will not and paste it”
Not useful. I told Claude to switch to an API-based model, and it added Groq’s free tier running whisper-large-v3. The difference was night and day.
I tested it in Portuguese to see if it could handle bilingual input:
Será que ele consegue entender o que eu falo em português também? Nossa, ele é bem rápido! (Can it understand what I say in Portuguese too? Wow, it’s really fast!)
Perfect transcription. Both languages. Free.
Accessibility by default
The first version of the tray icon used color-coded dots: grey for idle, red for recording, yellow for transcribing.
I’m color blind. So grey, red, and yellow were all just… dots.
I asked for shape-based indicators instead:
Now I can tell the state at a glance without needing to distinguish colors. AI defaults to the common case, and the common case isn’t universal.
5 minutes to evolve
After the initial build, I wanted a vocabulary system (custom words that the model should recognize, like project names and technical terms). Adding that feature: 5 minutes.
Later I wanted to swap from Groq to OpenAI’s whisper model for comparison. Swapping the backend: 5 minutes.
This is the long-term argument for owning your tools. When you control the code, every change is a conversation, not a feature request in someone else’s backlog. No waiting for a product team to prioritize your edge case. No discovering that the feature you need is on the “Enterprise” tier.
The cost comparison
| SuperWhisper | Wispr Flow | Sussurai | |
|---|---|---|---|
| Price | Free (local) / $8.49/mo (pro) | ~R$24/mo | Free |
| Platform | macOS only | macOS only | Linux |
| Source | Closed | Closed | Open (MIT) |
| Backend | Local Whisper | Local + API | Groq / OpenAI / Local |
| Build time | — | — | ~50 minutes |
But the price wasn’t the problem. The platform was. Wispr Flow could be free and I still couldn’t use it on Linux. I didn’t build Sussurai to save R$24/month. I built it because the tool I was using didn’t exist for my platform, and I wanted to test whether AI-assisted coding could close that gap in an afternoon.
What AI coding actually looks like
I want to be honest about what happened here, because the “AI writes perfect code” narrative is wrong. But the “AI is useless” narrative is also wrong. What actually happens is more interesting than either.
23 minutes of AI writing code. Planning architecture, choosing libraries, implementing features, writing tests. This part was genuinely fast and genuinely useful.
26 minutes of debugging together. Me testing, reporting back, Claude trying another approach, me testing again. Six different paste tools attempted. Someone had to look at the screen and say “nope, still broken.” That was me.
That ratio says a lot. AI is fast at generating code. The hard part is the last mile: system integration, platform quirks, edge cases that only surface when you actually run the thing. You still need someone who understands what’s happening and can guide the debugging. The AI didn’t know COSMIC needs 2 seconds to register a virtual keyboard. We figured that out together.
One important caveat: this was a personal tool, built from scratch, for my own machine. Greenfield, clear requirements, controlled environment. That’s why this approach worked so well. For production code, I’d take a very different path. Spec-driven development, more upfront design, different tradeoffs entirely. I might write about that in a future post.
But even in this context, the thing that made it work was tests. Claude Code wrote tests alongside the implementation, and that’s what gave me confidence that the pieces actually worked. When you’re building with AI at this speed, tests are your safety net. Without them, you’re just hoping the generated code does what you think it does. Something close to TDD is, I believe, the right approach when building with AI: write the tests, let the AI make them pass, verify the behavior. That loop is what keeps the speed without sacrificing correctness.
The broader point
For utility software (push-to-talk dictation, PDF converters, clipboard managers, screenshot tools), the build-vs-buy equation is shifting. Not for everything. Salesforce isn’t going anywhere. Figma isn’t going anywhere (for now…). Complex software with deep integrations and years of edge-case handling still requires large teams and sustained effort.
But for a tool that does one thing? If you can describe what you need in one sentence, you can probably build it in an afternoon. The bottleneck has moved from “can I write this code” to “do I know what I need.” Code is becoming free. Engineering judgment isn’t.
I use Sussurai every day now. It’s how I wrote parts of this post. It’s not as polished as SuperWhisper. No fancy UI, no app store listing, no onboarding flow. But it does exactly what I need: I press a button, I talk, the text appears. And when I want to change something, I just ask.
The source code is on GitHub. MIT license. If you’re on Linux and you’ve been missing good voice input, give it a try.