Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Categories

Computer Science

Natural Language Processing

Corpus Linguistics Basics

Flashcards

0/12

Still learning

Corpus

A corpus is a large and structured set of texts that are used for statistical analysis and hypothesis testing in NLP. It is significant for training and evaluating language models.

Concordance

A concordance is an alphabetical list of the principal words used in a corpus, with their immediate contexts. It's significant in NLP for understanding word usage and collocations.

Annotation

Annotation refers to the process of adding interpretative information to a corpus, like grammatical tags or semantic information, significant for building training datasets in NLP.

Tokenization

Tokenization is the process of breaking a text into words, phrases, symbols, or other meaningful elements called tokens. It's significant for input preparation in NLP tasks.

POS Tagging

Part-of-speech tagging is the process of assigning word types to each token in a corpus, such as noun, verb, adjective, etc. Important for syntactic analysis in NLP.

Collocation

A collocation is a sequence of words or terms that co-occur more often than would be expected by chance. In NLP, it helps in understanding language patterns and phrase building.

Corpus Representativeness

Corpus representativeness refers to the extent to which a corpus contains a sample of language that reflects its use in real-world contexts. It is key for the validity of NLP models.

n-gram

An n-gram is a contiguous sequence of n items from a given sample of text or speech. It is significant in NLP for predicting the probability of words and phrases.

Corpus Frequency

Corpus frequency is the count of how often a particular word or set of words appears in a corpus. In NLP, it's used for understanding word usage and informing probability models.

Stop Words

Stop words are common words such as 'the', 'is', and 'at', which are often filtered out of texts before processing in NLP, as they usually carry less meaningful information.

Lemma

A lemma is the canonical form, dictionary form, or citation form of a set of words. In NLP, lemmatization is used to condense inflected or variant forms to their base form.

Corpus Homogeneity

Corpus homogeneity refers to the degree to which texts in a corpus are similar in style, genre, or topic. It's important for specialized NLP tasks that require domain-specific training data.

Know

Still learning

Click to flip

Know