Explore tens of thousands of sets crafted by our community.
Speech Recognition Fundamentals
15
Flashcards
0/15
Acoustic Model
A statistical model that maps audio signals to phonetic units.
Word Error Rate (WER)
A common metric for evaluating the performance of a speech recognition system, calculated as the number of errors divided by the number of words spoken.
Language Model
A probabilistic framework that predicts the likelihood of a sequence of words.
Connectionist Temporal Classification (CTC)
An output layer for neural network models that allows the alignment of input audio frames with output labels, often used in end-to-end ASR.
Feature Extraction
The process of transforming raw audio data into a set of numerical features that can be processed by an ASR system.
End-to-End Speech Recognition
An ASR approach that directly maps raw audio to text using a single neural network model, without separate acoustic and language models.
Hidden Markov Model (HMM)
A statistical model used to represent stochastic processes, often employed in ASR to model the probability of transitions between acoustic model states.
Phoneme
The smallest unit of sound in speech that can distinguish one word from another.
Automatic Speech Recognition (ASR)
The use of computer algorithms to convert spoken language into text.
Beam Search
A search algorithm that expands only a limited set of nodes that are most promising, often used in ASR to decode the spoken words from an acoustic model's output.
Deep Neural Network (DNN)
A neural network with multiple hidden layers between the input and output layers, used in modern ASR for modeling complex patterns in data.
Voice Activity Detection (VAD)
A technique in ASR to determine the presence or absence of human speech within an audio stream.
Mel-Frequency Cepstral Coefficients (MFCCs)
A representation of short-term power spectrum of sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.
Speaker Diarization
The process of partitioning an input audio stream into homogenous segments according to the speaker identity.
Continuous Speech Recognition
Recognition of natural speech where words are spoken in full sentences without pausing between them.
© Hypatia.Tech. 2024 All rights reserved.