← Lessons

quiz vs the machine

Gold1410

Machine Learning

Automatic Speech Recognition

Turning spoken audio into text with sequence models.

5 min read · core · beat Gold to climb

What it is

Automatic speech recognition, or ASR, converts spoken audio into written text. A model maps a waveform, which has thousands of samples per second, to a much shorter sequence of words.

The pipeline

  • Feature extraction: convert the waveform into a spectrogram or log mel features that show energy across frequencies over time.
  • Acoustic modeling: a neural network reads the features and predicts sound units or characters.
  • Decoding: turn frame level predictions into the final text, often with a language model to favor plausible word sequences.

Approaches

  • CTC lets a model output a character per frame and collapse repeats and blanks into words, which handles unknown alignment between audio and text.
  • Sequence to sequence with attention, as in encoder decoder transformers, generates text directly and can fold in punctuation and casing.

Large modern systems train on huge weakly labeled audio and become robust across accents and noise, sometimes adding translation.

Key idea

ASR maps audio features to text through an acoustic model and a decoder, using CTC or attention to bridge the mismatch between audio frames and words.

Check yourself

Answer to earn rating on the learn ladder.

1. What problem does CTC solve in speech recognition?

2. Why convert the raw waveform into spectrogram or mel features first?