Explore tens of thousands of sets crafted by our community.
NLP Preprocessing Techniques
15
Flashcards
0/15
Regex Filtering
Regex Filtering uses regular expressions to extract or replace patterns in the text, often used to identify specific textual forms like email addresses, URLs, or dates.
Bag of Words
Bag of Words is a simplistic representation of text that describes the presence of words within a document, disregarding grammar and word order but keeping multiplicity.
Lowercasing
Lowercasing involves converting all characters in the text to their lowercase form to achieve case-insensitivity, which helps in standardizing the text and reducing the feature space.
Chunking
Chunking groups adjacent tokens into chunks based on their syntactic patterns, often used in the context of named entity recognition or to explore phrase structure.
Word Embedding
Word Embedding is the representation of words in a dense vector form which captures semantic meanings and relationships between words, used in deep learning models.
N-Grams
N-Grams are contiguous sequences of 'n' items from a given sample of text or speech, which helps in capturing the context and predicting the next item in the sequence.
Normalization
Normalization refers to the process of transforming text into a canonical (standard) form. This can include correcting spelling, expanding abbreviations, and unifying synonyms.
Noise Removal
Noise Removal involves cleaning the text from irrelevant items such as special characters, white spaces, and punctuation, which are not necessary for the analysis or model training.
Lemmatization
Lemmatization reduces words to their lemmas, which is their dictionary form, using morphological analysis, resulting in proper words unlike the sometimes crude stemming.
Part-of-Speech Tagging
Part-of-Speech Tagging assigns parts of speech to each word of the text, such as nouns, verbs, adjectives, etc., which is useful for understanding the syntax and role of words in sentences.
TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic intended to reflect how important a word is to a document in a collection, emphasizing unique words and downscaling common ones.
Named Entity Recognition
Named Entity Recognition (NER) identifies and classifies named entities within text into predefined categories such as persons, organizations, locations, etc., to extract useful information.
Tokenization
Tokenization is the process of breaking down text into individual words, phrases, or symbols, called tokens, to make the text easier to analyze or process.
Stemming
Stemming is the heuristic process of cutting words down to their root form, which often involves removing inflectional endings, to reduce words to a common base form.
Stop Words Removal
Stop words removal is the process of eliminating common words (such as 'the', 'is', etc.) that usually have little lexical content from the text, to focus on the important words.
© Hypatia.Tech. 2024 All rights reserved.