Unveiling the Magic of Self-Attention: A Journey Through the Gatekeepers to the Mirror of Awareness
How the hidden state vectors pass through the first layer of the transformer block, the self attention mechanism, and the magic that happens inside to make these vectors self aware.
Self-attention is a pivotal mechanism in transformer models, allowing each token in a sequence to understand its context by attending to all others. It's like giving each word in a sentence the chance to look around and see how it relates to its neighbors. Self-attention begins with embedding vectors - these are the initial representations of tokens, unaware of their surroundings or the other tokens in the sequence. Let's guide these vectors on an enlightening journey through the self-attention process.
Our adventure starts with the hidden state vectors, which collectively form a matrix. This matrix, representing all tokens in the sequence, is copied into three identical matrices, each destined to meet one of the guardians of context: the three Gatekeepers.
The Q Gatekeeper (Query): One of these matrices approaches the Q Gatekeeper. Here, the entire matrix undergoes a transformation through weighted multiplication, focusing on what each vector within queries or seeks in the context. It's like giving each vector a new lens to view the sequence, one that highlights what they're looking for. The Q Gatekeeper imbues the matrix with a collective sense of inquiry, preparing each vector to seek connections.
The K Gatekeeper (Key): Another copy of the matrix encounters the K Gatekeeper. This gatekeeper assigns each vector within its 'key', determining how they can be matched or compared with others. Through another weighted multiplication, the K Gatekeeper provides each vector with unique glasses through which to see their identifiers, ensuring that each knows how to recognize itself in relation to the whole sequence.
The V Gatekeeper (Value): The final matrix is greeted by the V Gatekeeper, who preserves and enhances the inherent values of all vectors. With its magic, which involves weights like special lenses to focus or broaden each vector's view, this gatekeeper ensures that the essence of each word is clear and ready to contribute meaningfully to the collective understanding.
The beauty here is in the parallelism - each gatekeeper casts its spell on all vectors at once, transforming the entire group into a new form that echoes their query, key, or value.
Once all three matrices have been transformed by their respective Gatekeepers, they converge at the heart of this journey – the Mirror of Awareness. The Mirror will only reveal its magic once all three transformed matrices are present:
The Mirror of Awareness: Here, the matrices are not just reflected but dynamically interact. The Mirror calculates attention scores by comparing the queries (Q) with the keys (K), determining how much each word should attend to every other word in the sequence. It's like the reflections in the mirror dance and interact, with each vector's light adjusting based on its connections to others, revealing the true network of relationships. A softmax function then normalizes these scores, akin to adjusting the light to see these connections clearly. Finally, using these weights, the Mirror blends the values (V), creating new representations for each word that are now aware of their context, their relations to other words. It's as if each word has seen itself through the eyes of every other word, gaining a profound understanding of its place within the sequence.
As these vectors, now self-aware and contextually enriched, prepare to leave the Mirror of Awareness, they are met by the Normalization Escort at the exit. This escort ensures that each vector shines just right, not too dim, nor too bright, preparing them for what lies beyond. With a wish of good luck, the vectors pass through to meet the two Prisms of the Feed Forward Network, where they will further refine their understanding.
In this journey, we've seen how self-attention transforms mere embeddings into vectors that 'know thyself' among others. The process is not just about altering data but about changing perspectives, giving each word a chance to understand its narrative role. This lays the groundwork for deeper comprehension in the transformer block, where each step from the Gatekeepers to the Mirror of Awareness, and beyond, orchestrates a symphony of understanding from what was once just a collection of isolated notes.