About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

I spend hours in Claude Code every day. Long sessions where I am reading, thinking, switching contexts, and occasionally glancing at the terminal to see if the agent finished a task. The problem: Claude Code is silent. It finishes a 10-minute build-and-deploy pipeline and just sits there, cursor blinking, waiting for me to notice. The whole concept here was inspired by J.A.R.V.I.S. from the Iron Man films, voiced by Paul Bettany. Tony Stark's AI assistant announces status, flags problems, and delivers dry commentary while Stark works on something else entirely. I wanted that. An AI assistant that speaks. That announces when it starts a task and summarizes what it accomplished when it finishes. Like a competent colleague who taps you on the shoulder and says "that deployment is done, here's what happened."
Thirty minutes of setup gave me exactly that. Claude Code now speaks through ElevenLabs text-to-speech, streaming audio through my speakers with ~300ms latency. A short bash script, an API key, and a prompt block in CLAUDE.md turned a silent terminal agent into one that announces its work. This article walks through the full implementation: the script, the prompt engineering, the voice selection, and the cost math. If you use Claude Code for extended sessions and want ambient awareness of what your agent is doing without watching the terminal, this is for you.
CLAUDE.md file, and it is a nod to Interstellar. TARS addressing Cooper in that film captures exactly the dynamic I wanted: a dry, competent AI that treats you as the mission commander. The British butler tone and the name alternation create a surprisingly immersive collaboration feel after a few days of use.Why Voice Output Changes the Workflow
The Attention Problem
Claude Code runs in a terminal. When it finishes a task, the only signal is that new text appears on screen. If you are in another window (reviewing a PR, reading documentation, responding to Slack), you miss it. You context-switch back to the terminal, realize the task finished three minutes ago, and lose those three minutes of idle time. Multiply that across a full workday of agent-assisted development and the accumulated dead time is significant.
What Voice Adds
Voice output solves the attention problem without requiring visual focus. I hear "That's done, Cooper. Five articles deployed to staging, all AI scores under threshold." from across the room and I know the state of my work without looking at the terminal. Three specific benefits:
| Benefit | Without Voice | With Voice |
|---|---|---|
| Task completion awareness | Must watch terminal | Hear it from anywhere |
| Error notification | Discover on next glance | Hear immediately |
| Context retention | Re-read output to recall what happened | Spoken summary sticks in memory |
| Multi-task efficiency | Check terminal between tasks | Continue working, hear updates |
The psychological effect surprised me. Having the agent announce its work creates a sense of collaboration that a silent terminal lacks. It feels like pair programming with a colleague who happens to work at 100x speed.
The Architecture: Three Components
The entire implementation is three pieces: a bash script that calls the ElevenLabs streaming TTS API, an .env file with credentials, and a prompt block in CLAUDE.md that instructs Claude Code when and how to use the script.
Component Overview
Claude Code calls the script through its Bash tool, passing the text to speak as an argument. The script POSTs to the ElevenLabs streaming endpoint, which returns audio chunks progressively. Those chunks pipe directly into mpv, which starts playing before the full response arrives. End-to-end latency from Claude Code deciding to speak to audio hitting the speakers is roughly 300-400ms.
Dependencies
| Component | Purpose | Installation |
|---|---|---|
curl |
HTTP client for ElevenLabs API | Pre-installed on macOS/Linux |
jq |
JSON payload construction | brew install jq or apt install jq |
mpv |
Audio player with stdin streaming | brew install mpv or apt install mpv |
| ElevenLabs account | TTS API access | elevenlabs.io |
The script has no Python dependencies, no Node.js runtime, no Docker container. Four command-line tools and an API key. That simplicity matters because Claude Code invokes this script potentially dozens of times per session; startup overhead needs to be near zero.
The Script
Create the directory structure:
~/.claude/scripts/
├── .env # API credentials
└── speak.sh # TTS script
The .env File
ELEVENLABS_API_KEY=your_api_key_here
ELEVENLABS_VOICE_ID=your_voice_id_here
Store your ElevenLabs API key and voice ID here. The script sources this file at runtime. Keep it out of version control.
speak.sh
#!/bin/bash
# Claude Code TTS — streams ElevenLabs audio through mpv
# Falls back to macOS `say` if ElevenLabs is unreachable
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$SCRIPT_DIR/.env"
TEXT="$1"
if curl -sN --fail "https://api.elevenlabs.io/v1/text-to-speech/${ELEVENLABS_VOICE_ID}/stream" \
-H "xi-api-key: ${ELEVENLABS_API_KEY}" \
-H "Content-Type: application/json" \
-d "$(jq -n --arg text "$TEXT" '{
text: $text,
model_id: "eleven_turbo_v2",
voice_settings: {
stability: 0.5,
similarity_boost: 0.75
}
}')" \
| mpv --no-video --no-terminal --really-quiet - 2>/dev/null; then
:
else
say -v Daniel "ElevenLabs unavailable. Falling back to local voice."
say -v Daniel "$TEXT"
fi
Make it executable:
chmod +x ~/.claude/scripts/speak.sh
How It Works
The script does four things in a single pipeline, with a fallback if the API call fails:
- Sources credentials from the
.envfile adjacent to the script. - Constructs a JSON payload using
jqwith the text, model ID, and voice settings. - POSTs to the ElevenLabs streaming endpoint with
curl -sN --fail(silent mode, no-buffer for streaming, fail on HTTP errors). - Pipes the audio stream to
mpvwhich plays it in real time with no video window and no terminal output. - Falls back to macOS
sayif any step fails (network issue, expired key, rate limit). The fallback announces that ElevenLabs is unavailable before speaking the original text through the local Daniel voice. You always hear the announcement; the only question is which voice delivers it.
The eleven_turbo_v2 model delivers ~300ms time-to-first-byte. The voice settings control two parameters: stability (0.5 gives natural variation without wandering off-voice) and similarity_boost (0.75 keeps the output close to the selected voice's characteristics). I tuned these through experimentation; your preferences will vary.
Choosing a Voice
ElevenLabs offers three categories of voices:
| Voice Type | Description | Cost | Best For |
|---|---|---|---|
| Pre-made voices | Curated defaults optimized for reliability | Included in plan | Quick setup, consistent quality |
| Community voices | 10,000+ voices shared by users | Included in plan | Finding a specific character or accent |
| Cloned voices | Your own voice or a custom voice | Requires Pro plan+ | Brand consistency, personal preference |
I use a pre-made British male voice (the "butler" aesthetic fits the interaction model). Browse the ElevenLabs Voice Library to find one that suits your taste. Each voice has an ID string that goes in your .env file.
Voice Selection Tips
Pick a voice that is distinct from your own and from common notification sounds. The goal is instant recognition: when that voice speaks, you know it is your coding agent. I avoid voices that sound like podcast hosts or audiobook narrators because those blend into background audio. A slightly unusual accent or cadence cuts through ambient noise better.
Test your chosen voice with short, technical phrases. Some voices handle code terminology ("deployed to staging," "CI pipeline green," "three hundred millisecond latency") well. Others stumble on abbreviations, acronyms, or numbers. The turbo model handles technical language better than the older v1 models.
The CLAUDE.md Prompt
The script alone does nothing until Claude Code knows to call it. The prompt block in CLAUDE.md defines when to speak, what to say, and how to say it. Here is the exact prompt I use:
## Voice Announcements
Use the ElevenLabs TTS script for spoken announcements. Run in the background
so it doesn't block:
\`\`\`bash
~/.claude/scripts/speak.sh "Your message here" &
\`\`\`
### When starting a task
Speak a brief acknowledgement when beginning work. Address the user as "sir"
or "Cooper" (vary which one). The phrasing must vary every time but convey
"I'm on it." Never repeat the same wording twice in a session. Examples of
the *tone* (do NOT reuse these verbatim):
- "Right away, sir."
- "On it, Cooper."
- "Consider it done, sir."
- "Straightaway, Cooper."
- "I'll see to it at once, sir."
### When completing a task
Speak a brief 1-sentence summary of what was accomplished. Address the user
as "sir" or "Cooper" (vary which one). The phrasing must vary every time.
Keep it concise — what was done, key outcome. British butler tone. Examples
of the *tone* (do NOT reuse these verbatim):
- "All sorted, sir. The README has been updated and pushed."
- "That's done, Cooper. Terraform validates cleanly across all twelve files."
- "Taken care of, sir. Tests are green and the commit is pushed."
### General rules
- Always vary the phrasing — never use the same opening or structure
consecutively
- Alternate between "sir" and "Cooper" naturally
- Skip only for: pure Q&A conversations with no code or file changes
- When a task has an exceptionally high leverage factor (50x+), occasionally
mention it in the completion announcement. Keep it dry and understated —
e.g. "That would have taken a human the better part of a week, sir." or
"Roughly eighty hours of work in under ten minutes, Cooper." Don't do this
every time — just when the leverage is genuinely striking.
Why This Prompt Structure Works
Several design decisions in the prompt are deliberate:
Background execution with &. The trailing ampersand runs the script without blocking Claude Code's execution. Without it, the agent waits for the audio to finish playing before continuing work. With it, the agent speaks and keeps working simultaneously.
Forced variation. The instruction "never repeat the same wording twice in a session" prevents the robotic monotony of hearing the same phrase fifty times a day. Claude Code is good at varying phrasing when you explicitly ask for it. Without this instruction, it gravitates toward a small set of favorites.
Character consistency. The "British butler tone" instruction and the name/honorific alternation create a consistent personality. After a few days, the voice becomes a recognizable character rather than a generic TTS notification. This matters for the psychological benefit I mentioned earlier: collaboration feels more real when the collaborator has a consistent voice and manner.
Selective leverage mentions. The instruction to occasionally comment on high-leverage tasks adds a layer of awareness that reinforces the value of the AI-assisted workflow. Hearing "That would have been three weeks of work for a human team, sir" after watching a 12-minute task complete is a visceral reminder of what this tooling makes possible.
Prompt Placement
Put the voice announcement block in your global ~/.claude/CLAUDE.md if you want voice across all projects. Put it in a project-level CLAUDE.md if you only want voice for specific repositories. I use the global file because I want voice everywhere.
Cost Analysis
ElevenLabs bills per character. The turbo models cost 0.5 credits per character on self-serve plans.
Typical Usage
| Metric | Value |
|---|---|
| Average announcement length | 60 characters |
| Announcements per hour (active session) | 8-12 |
| Characters per hour | ~600 |
| Characters per 8-hour day | ~4,800 |
| Characters per month (22 working days) | ~105,600 |
The free tier provides 10,000 characters/month, which covers roughly two days of heavy use. The Starter plan ($5/month) provides 30,000 characters. The Creator plan ($22/month) provides 100,000 characters, which covers a typical month with room to spare.
| Plan | Monthly Characters | Monthly Cost | Coverage |
|---|---|---|---|
| Free | 10,000 | $0 | ~2 working days |
| Starter | 30,000 | $5 | ~6 working days |
| Creator | 100,000 | $22 | Full month with headroom |
| Pro | 500,000 | $99 | Heavy use across multiple projects |
For my usage pattern (6-10 hours of Claude Code per day, 5-6 days per week), the Creator plan covers it. The announcements are short. A typical completion announcement like "Taken care of, sir. Three articles deployed to production with all AI scores passing." is 78 characters. At 0.5 credits per character on turbo, that is 39 credits per announcement. The math works out to roughly $0.01-0.02 per announcement at Creator plan rates.
Free Alternatives on macOS
If you want voice output without any recurring cost, macOS has built-in text-to-speech via the say command. No API key, no network dependency, zero latency to first audio. A minimal version of the script:
#!/bin/bash
say -v Daniel "$1"
The Daniel voice is a British English option that ships with macOS. Other voices are available in System Settings > Accessibility > Spoken Content > System Voice. You can download higher-quality voices there as well.
| Approach | Voice Quality | Latency | Cost | Offline Capable |
|---|---|---|---|---|
| ElevenLabs API | Excellent, near-human | ~300ms (network dependent) | $0-99/month | No |
macOS say (default voices) |
Functional, robotic | Instant | Free | Yes |
macOS say (downloaded premium voices) |
Good, natural cadence | Instant | Free | Yes |
I chose ElevenLabs because the voice quality makes a meaningful difference over hours of listening. The built-in voices work, but they sound like what they are: synthesized speech. After a full day of hearing announcements, the naturalness of ElevenLabs reduces fatigue. That said, say is a perfectly viable starting point, and you can always upgrade later.
Operational Notes
Latency Tuning
The eleven_turbo_v2 model targets ~300ms time-to-first-byte for streaming. In practice, I see 250-400ms depending on network conditions and text length. For the short announcements Claude Code produces, the entire audio clip typically finishes generating before the first sentence finishes playing. The perceived latency is the time between Claude Code's bash call and audible sound: roughly half a second.
If latency matters more than voice quality for your use case, ElevenLabs also offers eleven_flash_v2_5 which targets sub-200ms latency at slightly reduced quality. For short announcements, the quality difference is negligible. Swap the model_id in the script to try it.
Failure Handling
If the ElevenLabs API call fails (network issue, expired key, rate limit), the script falls back to the macOS say command. You hear a brief "ElevenLabs unavailable" notice followed by the original announcement in the local Daniel voice. No announcement is ever lost. The fallback adds ~1 second of overhead compared to ElevenLabs streaming, but the tradeoff is worth it: you always know what your agent just did. Claude Code continues working regardless because the script runs in the background with &.
Volume and Environment
I run this in a home office. The announcements play through my desk speakers at conversation volume. In a shared office, you would want headphones or a lower volume. The mpv player respects system volume, so adjusting macOS volume works without script changes. For per-script volume control, add --volume=50 to the mpv flags (50 = half volume).
Multiple Concurrent Agents
If you run multiple Claude Code sessions simultaneously (I sometimes do, using Task agents in parallel), the announcements overlap. Each agent invokes its own speak.sh call, and mpv instances play concurrently. The voices layer on top of each other, which is occasionally confusing. One solution: assign different voices to different project directories by using project-level .env files instead of a single global one.
Key Takeaways
- The setup is trivial. One bash script, one
.envfile, one prompt block inCLAUDE.md. Under thirty minutes from start to hearing your first announcement. No Python, no Node, no containers. - The prompt engineering matters more than the script. The
CLAUDE.mdinstructions that define when to speak, what tone to use, and how to vary phrasing turn a raw TTS call into a coherent interaction pattern. Invest time tuning the personality and the variation rules. - Background execution is critical. Always append
&to the speak command. Voice output should never block agent work. A silent, fast agent beats a vocal, slow one every time. - Cost is negligible. Individual announcements cost roughly a penny each. Even heavy daily use runs $0.50-1.00 per day on the Creator plan. The free macOS
saycommand works if you want zero cost. - Voice creates presence. A silent terminal agent is easy to ignore. A speaking agent feels like a collaborator. That psychological shift changes how you structure your work: you delegate more freely, context-switch more confidently, and catch errors faster.
Additional Resources
- ElevenLabs API Documentation
- ElevenLabs Text-to-Speech Streaming
- ElevenLabs Pricing
- ElevenLabs Voice Library
- ElevenLabs Latency Optimization
- mpv Media Player
- Overlooked Productivity Boosts with Claude Code
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.