Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Categories

Computer Science

Natural Language Processing

NLP Evaluation Metrics

Flashcards

0/10

Still learning

BLEU Score

The BLEU Score (Bilingual Evaluation Understudy) is calculated by comparing the machine-generated text to one or more reference translations. It considers the precision of n-grams in the generated text and applies a brevity penalty for overly short translations.

ROUGE Score

The ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation) measures the quality of a summary by comparing it to reference summaries. It focuses on recall by calculating the fraction of n-grams in the reference summaries that are also found in the generated summary.

PERPLEXITY

Perplexity measures how well a probability model predicts a sample. It is calculated as the exponentiated average negative log-likelihood of a model predicting a sequence.

WER (Word Error Rate)

WER is calculated by taking the total count of errors (substitutions, insertions, deletions) made by an ASR system and dividing it by the number of words in the reference transcription.

WER = \frac{S + D + I}{N}

where S is substitutions, D is deletions, I is insertions, and N is the number of words in the reference.

METEOR Score

The METEOR Score (Metric for Evaluation of Translation with Explicit Ordering) is a metric for machine translation evaluation that considers exact, stemmed, and synonym matches between the generated text and reference translations. It also accounts for the harmonious ordering of words.

F-Measure (F1 Score)

The F-Measure, or F1 Score, is the harmonic mean of precision and recall.

F1 = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}

It is used to consider both the precision and recall of a model, often in binary classification tasks.

Accuracy

Accuracy is the proportion of correct predictions (both true positives and true negatives) over all predictions.

\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}

It's commonly used for classification tasks.

Recall

Recall is the proportion of actual positives that were identified correctly.

\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}

It's important for tasks where missing positives is costly.

Precision

Precision is the proportion of positive identifications that were actually correct.

\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}

It is important in cases where false positives are a significant concern.

AUC-ROC (Area Under the Receiver Operating Characteristics Curve)

AUC-ROC is a performance measurement for binary classification problems. It measures the area under the curve created by plotting the true positive rate against the false positive rate at various threshold settings. The AUC represents the likelihood that the model ranks a random positive example more highly than a random negative example.

Know

Still learning

Click to flip

Know