Discord Music Agent

An LLM-powered Discord music bot that understands vibes, not just song titles — ask for "some chill vibes" and Gemini curates the queue.

Why this exists

Every Discord music bot I'd used had the same problem: they were search engines with a play button. You had to know exactly what you wanted — the right title, the right artist, spelled the right way. If you asked for "that sad song from the movie," you got nothing.

I wanted a bot that worked the way a friend would. You say "play some chill vibes," it picks something. You say "that one song that goes da-da-da-daaa," it tries to figure it out. You say "what's the weather," it politely declines instead of burning a token pretending to search.

So I built one around a Gemini agent.

How it thinks

The core idea is intent routing. Every /play request goes through a GeminiAgent that classifies it into one of five outcomes:

play — specific request, hand it straight to YouTube and queue it
clarify — ambiguous, ask the user to pick from a shortlist
suggest — vague mood, offer a few curated picks
playlist — full curation request, build a 10–15 track queue
reject — not a music request, decline without wasting tokens

The unambiguous commands — /skip, /pause, /stop, /volume — bypass the LLM entirely. There's no reason to pay Gemini to tell you that "skip" means skip. Only the fuzzy inputs get the AI treatment.

Architecture

Each Discord server gets its own MusicAgent instance — voice connection, queue, and audio player all isolated per-guild so two servers can't step on each other.

The flow for a fuzzy request:

Slash command lands on the per-guild MusicAgent
MusicAgent hands the raw text to the GeminiAgent for intent classification
GeminiAgent returns one of five intents:
- play — pass to YouTubeService, queue, hand to AudioPlayer
- clarify — ask the user to pick from a shortlist
- suggest — offer a few curated picks
- playlist — build a 10–15 track draft queue
- reject — decline politely, no tokens spent

Direct commands like /skip and /pause bypass step 2 entirely — the MusicAgent handles them locally without ever calling Gemini.

The self-hosted constraint

YouTube blocks datacenter IPs aggressively. Running this on a VPS means roughly 80% of your playback requests fail. So the bot is self-hosted by design — it runs on a home server or a Raspberry Pi, on a residential IP that YouTube is happy with.

That constraint ended up being a feature. It kept the scope small: no multi-tenant infra, no scaling headaches, no abuse protection layer. One person, one box, one bot.

Things I'm proud of

Token discipline. Direct commands never touch the LLM. Even the rejection path returns early before any expensive reasoning. The bot is cheap to run even under heavy use.
Fail-fast startup. Environment variables go through Zod at boot. A missing DISCORD_TOKEN or bad GEMINI_API_KEY crashes the process immediately with a clear error, instead of dying mysteriously on the first /play.
Structured logging with secret redaction. Pino's redaction rules strip tokens from logs automatically, so you can pipe output anywhere without worrying about leaking credentials.
Channel lock via /setup. The bot carves out a dedicated music channel and refuses commands from anywhere else, which kills the "random person plays a song in #general" problem.
Auto-leave after 5 minutes of idle. No zombie connections, no "why is the bot still in voice chat at 3am."

Tech stack

TypeScript + Node.js 22
discord.js v14 + @discordjs/voice
Google Gemini (@google/generative-ai)
yt-dlp + ffmpeg
Pino for logging
Zod for env validation
Vitest for unit tests