Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Categories

Computer Science

Natural Language Processing

Transformers in NLP

Flashcards

0/15

Still learning

Transformer Model

The Transformer model is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder) but without the need for a recurrent architecture. It uses self-attention mechanisms to weigh the importance of different parts of the input data.

Attention Mechanism

Attention mechanisms in NLP allow models to focus on specific parts of the input sequence when generating a particular part of the output, improving the ability to remember long-range dependencies. They are key to transformer architectures.

Self-Attention

Self-Attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. It allows the model to integrate information from different parts of the sequence.

Multi-Head Attention

Multi-head attention consists of several attention layers running in parallel. This enables the model to jointly attend to information from different representation subspaces at different positions, capturing various aspects of information.

Positional Encoding

Positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks to provide a notion of token order, which the model would otherwise lack due to its non-recurrent nature.

Scaled Dot-Product Attention

This attention function computes the dot products of the query with all keys, divides each by $\sqrt{d_k}$ , and applies a softmax function to obtain the weights on the values. It is used in the transformer model's attention mechanisms.

Feed-Forward Neural Networks in Transformers

The feed-forward neural networks in the Transformer model are fully connected layers applied to each position separately and identically. They consist of two linear transformations with a ReLU activation in between.

Layer Normalization

Layer Normalization is a technique to stabilize the output of a neural net, applied inside the Transformer model. It normalizes the outputs across the features instead of the batch, which is important for training deep architectures.

Encoder

The encoder's role in the Transformer is to process the input sequence and map it into an abstract continuous representation that holds the semantic meaning of the input, which the decoder can then use.

Decoder

The decoder in the Transformer model takes the continuous representation from the encoder and generates the output sequence. It also features a stack of identical layers, but with an additional third sub-layer that performs multi-head attention over the encoder's output.

Masked Self-Attention

Masked self-attention in the Transformer's decoder ensures that the prediction for a certain position does not depend on the future tokens in the sequence. A mask is applied to the input of the softmax step of the self-attention to prevent future positions from being accessed.

Transformer-Based Models

Transformer-based models such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) have extended the Transformer architecture to create models that set new standards in NLP tasks like translation, summarization, and question answering.

Query, Key, and Value Vectors

In the attention mechanism, the input is transformed into query, key, and value vectors using learned weights. These vectors are used in computing the attention scores, reflecting how much attention is paid to other tokens when producing a representation of the current token.

Cross-Attention

Cross-Attention is an attention mechanism where the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. It helps the decoder focus on relevant parts of the input sequence.

Transformer Fine-Tuning

Transformer fine-tuning involves adjusting a pre-trained Transformer model on a specific task by training it further on a smaller dataset relevant to the task. This is crucial for adapting the general capabilities of Transformer models to more specialized tasks.

Know

Still learning

Click to flip

Know