Explore tens of thousands of sets crafted by our community.
Transformers in NLP
15
Flashcards
0/15
Layer Normalization
Layer Normalization is a technique to stabilize the output of a neural net, applied inside the Transformer model. It normalizes the outputs across the features instead of the batch, which is important for training deep architectures.
Self-Attention
Self-Attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. It allows the model to integrate information from different parts of the sequence.
Scaled Dot-Product Attention
This attention function computes the dot products of the query with all keys, divides each by , and applies a softmax function to obtain the weights on the values. It is used in the transformer model's attention mechanisms.
Query, Key, and Value Vectors
In the attention mechanism, the input is transformed into query, key, and value vectors using learned weights. These vectors are used in computing the attention scores, reflecting how much attention is paid to other tokens when producing a representation of the current token.
Feed-Forward Neural Networks in Transformers
The feed-forward neural networks in the Transformer model are fully connected layers applied to each position separately and identically. They consist of two linear transformations with a ReLU activation in between.
Positional Encoding
Positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks to provide a notion of token order, which the model would otherwise lack due to its non-recurrent nature.
Transformer Model
The Transformer model is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder) but without the need for a recurrent architecture. It uses self-attention mechanisms to weigh the importance of different parts of the input data.
Attention Mechanism
Attention mechanisms in NLP allow models to focus on specific parts of the input sequence when generating a particular part of the output, improving the ability to remember long-range dependencies. They are key to transformer architectures.
Cross-Attention
Cross-Attention is an attention mechanism where the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. It helps the decoder focus on relevant parts of the input sequence.
Transformer Fine-Tuning
Transformer fine-tuning involves adjusting a pre-trained Transformer model on a specific task by training it further on a smaller dataset relevant to the task. This is crucial for adapting the general capabilities of Transformer models to more specialized tasks.
Masked Self-Attention
Masked self-attention in the Transformer's decoder ensures that the prediction for a certain position does not depend on the future tokens in the sequence. A mask is applied to the input of the softmax step of the self-attention to prevent future positions from being accessed.
Multi-Head Attention
Multi-head attention consists of several attention layers running in parallel. This enables the model to jointly attend to information from different representation subspaces at different positions, capturing various aspects of information.
Transformer-Based Models
Transformer-based models such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) have extended the Transformer architecture to create models that set new standards in NLP tasks like translation, summarization, and question answering.
Encoder
The encoder's role in the Transformer is to process the input sequence and map it into an abstract continuous representation that holds the semantic meaning of the input, which the decoder can then use.
Decoder
The decoder in the Transformer model takes the continuous representation from the encoder and generates the output sequence. It also features a stack of identical layers, but with an additional third sub-layer that performs multi-head attention over the encoder's output.
© Hypatia.Tech. 2024 All rights reserved.