Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Categories

Computer Science

Natural Language Processing

NLP Preprocessing Techniques

Flashcards

0/15

Still learning

Regex Filtering

Regex Filtering uses regular expressions to extract or replace patterns in the text, often used to identify specific textual forms like email addresses, URLs, or dates.

Bag of Words

Bag of Words is a simplistic representation of text that describes the presence of words within a document, disregarding grammar and word order but keeping multiplicity.

Lowercasing

Lowercasing involves converting all characters in the text to their lowercase form to achieve case-insensitivity, which helps in standardizing the text and reducing the feature space.

Chunking

Chunking groups adjacent tokens into chunks based on their syntactic patterns, often used in the context of named entity recognition or to explore phrase structure.

Word Embedding

Word Embedding is the representation of words in a dense vector form which captures semantic meanings and relationships between words, used in deep learning models.

N-Grams

N-Grams are contiguous sequences of 'n' items from a given sample of text or speech, which helps in capturing the context and predicting the next item in the sequence.

Normalization

Normalization refers to the process of transforming text into a canonical (standard) form. This can include correcting spelling, expanding abbreviations, and unifying synonyms.

Noise Removal

Noise Removal involves cleaning the text from irrelevant items such as special characters, white spaces, and punctuation, which are not necessary for the analysis or model training.

Lemmatization

Lemmatization reduces words to their lemmas, which is their dictionary form, using morphological analysis, resulting in proper words unlike the sometimes crude stemming.

Part-of-Speech Tagging

Part-of-Speech Tagging assigns parts of speech to each word of the text, such as nouns, verbs, adjectives, etc., which is useful for understanding the syntax and role of words in sentences.

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic intended to reflect how important a word is to a document in a collection, emphasizing unique words and downscaling common ones.

Named Entity Recognition

Named Entity Recognition (NER) identifies and classifies named entities within text into predefined categories such as persons, organizations, locations, etc., to extract useful information.

Tokenization

Tokenization is the process of breaking down text into individual words, phrases, or symbols, called tokens, to make the text easier to analyze or process.

Stemming

Stemming is the heuristic process of cutting words down to their root form, which often involves removing inflectional endings, to reduce words to a common base form.

Stop Words Removal

Stop words removal is the process of eliminating common words (such as 'the', 'is', etc.) that usually have little lexical content from the text, to focus on the important words.

Know

Still learning

Click to flip

Know