Understanding Transformers Part 9: Stacking Self-Attention Layers
In the previous article, we explored how the weights are shared in self-attention. Now we will see why we have these self-attention values instead of the initial positional encoding values. Using Self-Attention Values We now use the self-attention values instead of the original position...