When working with audio data, most people are familiar with waveforms, a simple graph showing the loudness of a sound over time. But for complex signals like speech or music, a waveform only tells part of the story. If you want to dig deeper into what a sound is made of, spectrograms are your next step.
A waveform visualizes changes in amplitude (loudness) over time. It’s great for basic signals like a pure tone, say, the A440 note in music, which repeats in a clean sine wave 440 times per second.
But the moment multiple tones overlap, things get complicated. The waveform becomes the sum of those frequencies, and visually, it’s hard to untangle what's what.
Let’s say you play the A440 tone (a sine wave at 440 Hz) at the same time as a C5 note (523.25 Hz). The resulting waveform isn’t two separate lines, it’s a sum of both waveforms. Sometimes the peaks line up and reinforce each other. Other times, they cancel each other out.
For example, at around 0.001 seconds, the A and C waves might both be at +0.5 amplitude, producing a combined amplitude of 1. But just a few milliseconds later, the A might be at -0.5 while the C is at +0.5, canceling each other out and flattening the wave.
This complex pattern gives us the full sound we hear, but it also hides the original ingredients. And when it comes to applications like speech recognition, audio editing, or machine learning, we often need to know exactly what those ingredients were.
When we record or analyze audio, we don’t get a neatly labeled list of the frequencies that created it. Instead, we get one complex waveform, a masterpiece made of many overlapping sounds.
For some applications, that’s fine. If all we care about is when a sound starts and stops, or its loudness, the waveform might be enough.
But in fields like speech recognition,or music analysisthe structure of the sound matters. We need to know: what frequencies made up that sound? Which ones were emphasized? How did they evolve over time?
That’s where we need a tool to look beneath the surface.
To separate out the components of a complex waveform, we use a mathematical tool called the Fourier transform.
In simple terms, it analyzes a waveform and reveals which frequencies are present and how strong they are. It turns a single, messy signal into a collection of building blocks, like separating a chord into individual notes.
We won’t get into the math here, but if you’re curious, we highly recommend 3Blue1Brown’s visual explanation of the Fourier transform. It’s one of the most intuitive introductions out there.
A spectrogram uses the Fourier transform to show how a sound’s frequency content changes over time.
Here’s how it works:
The result is a two-dimensional plot:
Let’s go back to the A440 + C5 example. The raw waveform showed a complex mix. But once we generate a spectrogram, we clearly see two distinct frequency bands, one at 440 Hz and one at 523 Hz. That’s the power of this tool: it reveals structure hidden inside the sound.
Nowhere is this more useful than in speech analysis. Human speech is made of rich, overlapping frequencies shaped by our vocal tract. These resonant frequencies, called formants, carry vital information about what sound was produced.
Take the words “bat” and “bag.” While the waveform may look similar early on, a spectrogram reveals key differences in the final sounds. The voiced “g” in “bag” produces what linguists call a velar pinch, a visible pattern where formants F1 and F2 converge. The voiceless “t” in “bat,” on the other hand, ends in a sharp cutoff.
Phoneticists often focus on:
F1 and F2: which indicate the actual vowel or consonant based on their distance and motion
In “bat,” the first word in this spectrogram, F0 cuts out sharply and F1 and F2 stay evenly spaced apart. In “bag,” the second word, F1 and F2 open and pinch back together at the end.. These differences are subtle in audio, but obvious in a spectrogram and critical for accurate transcription or language modeling.
Spectrograms aren’t just for linguists or ASR (automatic speech recognition) researchers. You’ll find them in:
If you’re working with sound data, spectrograms help you understand how that sound was produced, not just when.
Watch the full in the loop video here: