Logo
Pattern

Discover published sets by community

Explore tens of thousands of sets crafted by our community.

NLP Preprocessing Techniques

15

Flashcards

0/15

Still learning
StarStarStarStar

Regex Filtering

StarStarStarStar

Regex Filtering uses regular expressions to extract or replace patterns in the text, often used to identify specific textual forms like email addresses, URLs, or dates.

StarStarStarStar

Bag of Words

StarStarStarStar

Bag of Words is a simplistic representation of text that describes the presence of words within a document, disregarding grammar and word order but keeping multiplicity.

StarStarStarStar

Lowercasing

StarStarStarStar

Lowercasing involves converting all characters in the text to their lowercase form to achieve case-insensitivity, which helps in standardizing the text and reducing the feature space.

StarStarStarStar

Chunking

StarStarStarStar

Chunking groups adjacent tokens into chunks based on their syntactic patterns, often used in the context of named entity recognition or to explore phrase structure.

StarStarStarStar

Word Embedding

StarStarStarStar

Word Embedding is the representation of words in a dense vector form which captures semantic meanings and relationships between words, used in deep learning models.

StarStarStarStar

N-Grams

StarStarStarStar

N-Grams are contiguous sequences of 'n' items from a given sample of text or speech, which helps in capturing the context and predicting the next item in the sequence.

StarStarStarStar

Normalization

StarStarStarStar

Normalization refers to the process of transforming text into a canonical (standard) form. This can include correcting spelling, expanding abbreviations, and unifying synonyms.

StarStarStarStar

Noise Removal

StarStarStarStar

Noise Removal involves cleaning the text from irrelevant items such as special characters, white spaces, and punctuation, which are not necessary for the analysis or model training.

StarStarStarStar

Lemmatization

StarStarStarStar

Lemmatization reduces words to their lemmas, which is their dictionary form, using morphological analysis, resulting in proper words unlike the sometimes crude stemming.

StarStarStarStar

Part-of-Speech Tagging

StarStarStarStar

Part-of-Speech Tagging assigns parts of speech to each word of the text, such as nouns, verbs, adjectives, etc., which is useful for understanding the syntax and role of words in sentences.

StarStarStarStar

TF-IDF

StarStarStarStar

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic intended to reflect how important a word is to a document in a collection, emphasizing unique words and downscaling common ones.

StarStarStarStar

Named Entity Recognition

StarStarStarStar

Named Entity Recognition (NER) identifies and classifies named entities within text into predefined categories such as persons, organizations, locations, etc., to extract useful information.

StarStarStarStar

Tokenization

StarStarStarStar

Tokenization is the process of breaking down text into individual words, phrases, or symbols, called tokens, to make the text easier to analyze or process.

StarStarStarStar

Stemming

StarStarStarStar

Stemming is the heuristic process of cutting words down to their root form, which often involves removing inflectional endings, to reduce words to a common base form.

StarStarStarStar

Stop Words Removal

StarStarStarStar

Stop words removal is the process of eliminating common words (such as 'the', 'is', etc.) that usually have little lexical content from the text, to focus on the important words.

Know
0
Still learning
Click to flip
Know
0
Logo

© Hypatia.Tech. 2024 All rights reserved.