Discord Music Agent
An LLM-powered Discord music bot that understands vibes, not just song titles — ask for "some chill vibes" and Gemini curates the queue.



Most Discord music bots are glorified search boxes — you give them an exact title, they play it. This one treats requests the way a friend would: you describe a mood, a fragment of lyrics, or a half-remembered song, and a Gemini agent figures out what you actually want.
Under the hood, each Discord server gets its own MusicAgent instance that owns the voice connection, audio player, and queue. Slash commands that are unambiguous (/skip, /pause) bypass the LLM entirely to save tokens — only the fuzzy requests go through Gemini. The agent can clarify ambiguous asks, suggest tracks for a mood, curate full playlists on demand, and politely refuse non-music requests without wasting a token on them.
It's self-hosted on purpose: YouTube blocks datacenter IPs, so the bot runs on a home server or a Raspberry Pi. A deliberate constraint that kept the scope honest.
highlights
- Per-guild MusicAgent with isolated voice connection, queue, and audio state
- Gemini-powered intent routing — play, clarify, suggest, curate, or reject
- AI playlist curation: "/playlist 90s road trip" → a 10–15 track queue
- Direct commands skip the LLM entirely to stay token-efficient
- Channel lock, auto-leave after 5min idle, structured Pino logging with secret redaction
- Zod env validation — fails fast at startup if anything's missing
Why this exists
Every Discord music bot I'd used had the same problem: they were search engines with a play button. You had to know exactly what you wanted — the right title, the right artist, spelled the right way. If you asked for "that sad song from the movie," you got nothing.
I wanted a bot that worked the way a friend would. You say "play some chill vibes," it picks something. You say "that one song that goes da-da-da-daaa," it tries to figure it out. You say "what's the weather," it politely declines instead of burning a token pretending to search.
So I built one around a Gemini agent.
How it thinks
The core idea is intent routing. Every /play request goes through a GeminiAgent that classifies it into one of five outcomes:
- play — specific request, hand it straight to YouTube and queue it
- clarify — ambiguous, ask the user to pick from a shortlist
- suggest — vague mood, offer a few curated picks
- playlist — full curation request, build a 10–15 track queue
- reject — not a music request, decline without wasting tokens
The unambiguous commands — /skip, /pause, /stop, /volume — bypass the LLM entirely. There's no reason to pay Gemini to tell you that "skip" means skip. Only the fuzzy inputs get the AI treatment.
Architecture
Each Discord server gets its own MusicAgent instance — voice connection, queue, and audio player all isolated per-guild so two servers can't step on each other.
The flow for a fuzzy request:
- Slash command lands on the per-guild
MusicAgent MusicAgenthands the raw text to theGeminiAgentfor intent classificationGeminiAgentreturns one of five intents:- play — pass to
YouTubeService, queue, hand toAudioPlayer - clarify — ask the user to pick from a shortlist
- suggest — offer a few curated picks
- playlist — build a 10–15 track draft queue
- reject — decline politely, no tokens spent
- play — pass to
Direct commands like /skip and /pause bypass step 2 entirely — the MusicAgent handles them locally without ever calling Gemini.
The self-hosted constraint
YouTube blocks datacenter IPs aggressively. Running this on a VPS means roughly 80% of your playback requests fail. So the bot is self-hosted by design — it runs on a home server or a Raspberry Pi, on a residential IP that YouTube is happy with.
That constraint ended up being a feature. It kept the scope small: no multi-tenant infra, no scaling headaches, no abuse protection layer. One person, one box, one bot.
Things I'm proud of
- Token discipline. Direct commands never touch the LLM. Even the rejection path returns early before any expensive reasoning. The bot is cheap to run even under heavy use.
- Fail-fast startup. Environment variables go through Zod at boot. A missing
DISCORD_TOKENor badGEMINI_API_KEYcrashes the process immediately with a clear error, instead of dying mysteriously on the first/play. - Structured logging with secret redaction. Pino's redaction rules strip tokens from logs automatically, so you can pipe output anywhere without worrying about leaking credentials.
- Channel lock via
/setup. The bot carves out a dedicated music channel and refuses commands from anywhere else, which kills the "random person plays a song in #general" problem. - Auto-leave after 5 minutes of idle. No zombie connections, no "why is the bot still in voice chat at 3am."
Tech stack
- TypeScript + Node.js 22
- discord.js v14 + @discordjs/voice
- Google Gemini (
@google/generative-ai) - yt-dlp + ffmpeg
- Pino for logging
- Zod for env validation
- Vitest for unit tests