Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Categories

Computer Science

Natural Language Processing

Speech Recognition Fundamentals

Flashcards

0/15

Still learning

Automatic Speech Recognition (ASR)

The use of computer algorithms to convert spoken language into text.

Phoneme

The smallest unit of sound in speech that can distinguish one word from another.

Acoustic Model

A statistical model that maps audio signals to phonetic units.

Language Model

A probabilistic framework that predicts the likelihood of a sequence of words.

Word Error Rate (WER)

A common metric for evaluating the performance of a speech recognition system, calculated as the number of errors divided by the number of words spoken.

Hidden Markov Model (HMM)

A statistical model used to represent stochastic processes, often employed in ASR to model the probability of transitions between acoustic model states.

Beam Search

A search algorithm that expands only a limited set of nodes that are most promising, often used in ASR to decode the spoken words from an acoustic model's output.

Feature Extraction

The process of transforming raw audio data into a set of numerical features that can be processed by an ASR system.

Mel-Frequency Cepstral Coefficients (MFCCs)

A representation of short-term power spectrum of sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

End-to-End Speech Recognition

An ASR approach that directly maps raw audio to text using a single neural network model, without separate acoustic and language models.

Deep Neural Network (DNN)

A neural network with multiple hidden layers between the input and output layers, used in modern ASR for modeling complex patterns in data.

Connectionist Temporal Classification (CTC)

An output layer for neural network models that allows the alignment of input audio frames with output labels, often used in end-to-end ASR.

Voice Activity Detection (VAD)

A technique in ASR to determine the presence or absence of human speech within an audio stream.

Speaker Diarization

The process of partitioning an input audio stream into homogenous segments according to the speaker identity.

Continuous Speech Recognition

Recognition of natural speech where words are spoken in full sentences without pausing between them.

Know

Still learning

Click to flip

Know