What it is
Automatic speech recognition, or ASR, converts spoken audio into written text. A model maps a waveform, which has thousands of samples per second, to a much shorter sequence of words.
The pipeline
- Feature extraction: convert the waveform into a spectrogram or log mel features that show energy across frequencies over time.
- Acoustic modeling: a neural network reads the features and predicts sound units or characters.
- Decoding: turn frame level predictions into the final text, often with a language model to favor plausible word sequences.
Approaches
- CTC lets a model output a character per frame and collapse repeats and blanks into words, which handles unknown alignment between audio and text.
- Sequence to sequence with attention, as in encoder decoder transformers, generates text directly and can fold in punctuation and casing.
Large modern systems train on huge weakly labeled audio and become robust across accents and noise, sometimes adding translation.
Key idea
ASR maps audio features to text through an acoustic model and a decoder, using CTC or attention to bridge the mismatch between audio frames and words.