Explore tens of thousands of sets crafted by our community.
Corpus Linguistics Basics
12
Flashcards
0/12
Corpus Representativeness
Corpus representativeness refers to the extent to which a corpus contains a sample of language that reflects its use in real-world contexts. It is key for the validity of NLP models.
Annotation
Annotation refers to the process of adding interpretative information to a corpus, like grammatical tags or semantic information, significant for building training datasets in NLP.
Corpus Frequency
Corpus frequency is the count of how often a particular word or set of words appears in a corpus. In NLP, it's used for understanding word usage and informing probability models.
n-gram
An n-gram is a contiguous sequence of n items from a given sample of text or speech. It is significant in NLP for predicting the probability of words and phrases.
Corpus
A corpus is a large and structured set of texts that are used for statistical analysis and hypothesis testing in NLP. It is significant for training and evaluating language models.
Lemma
A lemma is the canonical form, dictionary form, or citation form of a set of words. In NLP, lemmatization is used to condense inflected or variant forms to their base form.
Tokenization
Tokenization is the process of breaking a text into words, phrases, symbols, or other meaningful elements called tokens. It's significant for input preparation in NLP tasks.
Corpus Homogeneity
Corpus homogeneity refers to the degree to which texts in a corpus are similar in style, genre, or topic. It's important for specialized NLP tasks that require domain-specific training data.
POS Tagging
Part-of-speech tagging is the process of assigning word types to each token in a corpus, such as noun, verb, adjective, etc. Important for syntactic analysis in NLP.
Collocation
A collocation is a sequence of words or terms that co-occur more often than would be expected by chance. In NLP, it helps in understanding language patterns and phrase building.
Stop Words
Stop words are common words such as 'the', 'is', and 'at', which are often filtered out of texts before processing in NLP, as they usually carry less meaningful information.
Concordance
A concordance is an alphabetical list of the principal words used in a corpus, with their immediate contexts. It's significant in NLP for understanding word usage and collocations.
© Hypatia.Tech. 2024 All rights reserved.