Logo
Pattern

Discover published sets by community

Explore tens of thousands of sets crafted by our community.

NLP Evaluation Metrics

10

Flashcards

0/10

Still learning
StarStarStarStar

BLEU Score

StarStarStarStar

The BLEU Score (Bilingual Evaluation Understudy) is calculated by comparing the machine-generated text to one or more reference translations. It considers the precision of n-grams in the generated text and applies a brevity penalty for overly short translations.

StarStarStarStar

F-Measure (F1 Score)

StarStarStarStar

The F-Measure, or F1 Score, is the harmonic mean of precision and recall.

F1=2×precision×recallprecision+recallF1 = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}
It is used to consider both the precision and recall of a model, often in binary classification tasks.

StarStarStarStar

WER (Word Error Rate)

StarStarStarStar

WER is calculated by taking the total count of errors (substitutions, insertions, deletions) made by an ASR system and dividing it by the number of words in the reference transcription.

WER=S+D+INWER = \frac{S + D + I}{N}
where S is substitutions, D is deletions, I is insertions, and N is the number of words in the reference.

StarStarStarStar

Precision

StarStarStarStar

Precision is the proportion of positive identifications that were actually correct.

Precision=True PositivesTrue Positives+False Positives\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
It is important in cases where false positives are a significant concern.

StarStarStarStar

AUC-ROC (Area Under the Receiver Operating Characteristics Curve)

StarStarStarStar

AUC-ROC is a performance measurement for binary classification problems. It measures the area under the curve created by plotting the true positive rate against the false positive rate at various threshold settings. The AUC represents the likelihood that the model ranks a random positive example more highly than a random negative example.

StarStarStarStar

PERPLEXITY

StarStarStarStar

Perplexity measures how well a probability model predicts a sample. It is calculated as the exponentiated average negative log-likelihood of a model predicting a sequence.

StarStarStarStar

ROUGE Score

StarStarStarStar

The ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation) measures the quality of a summary by comparing it to reference summaries. It focuses on recall by calculating the fraction of n-grams in the reference summaries that are also found in the generated summary.

StarStarStarStar

Accuracy

StarStarStarStar

Accuracy is the proportion of correct predictions (both true positives and true negatives) over all predictions.

Accuracy=Number of correct predictionsTotal number of predictions\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}
It's commonly used for classification tasks.

StarStarStarStar

METEOR Score

StarStarStarStar

The METEOR Score (Metric for Evaluation of Translation with Explicit Ordering) is a metric for machine translation evaluation that considers exact, stemmed, and synonym matches between the generated text and reference translations. It also accounts for the harmonious ordering of words.

StarStarStarStar

Recall

StarStarStarStar

Recall is the proportion of actual positives that were identified correctly.

Recall=True PositivesTrue Positives+False Negatives\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
It's important for tasks where missing positives is costly.

Know
0
Still learning
Click to flip
Know
0
Logo

© Hypatia.Tech. 2024 All rights reserved.