Understanding the Context Vector in Attention
Attention can be understood as a three-stage pipeline that turns token relationships into a context vector. This post builds an intuition for each stage and shows minimal Python examples you can run end to end.
The three stages of attention
| Stage | Math | Meaning |
|---|---|---|
| Query-Key | QK^T | Decide relevance |
| Softmax | normalize | Choose importance |
| Value mixing | AV | Build context |
Intuition
- Query-Key stage: each token asks which other tokens are relevant to it.
- Softmax stage: convert similarities into probabilities (importance weights).
- Value mixing stage: combine information from relevant tokens to build the context vector.
Final formula:
Context = softmax(QK^T) V
Minimal Python example of attention
import numpy as np
# 3 tokens, embedding dimension 4
X = np.array([
[1, 0, 1, 0], # how
[0, 1, 1, 0], # are
[1, 1, 0, 1] # you
])
# projection matrices
Wq = np.random.randn(4, 2)
Wk = np.random.randn(4, 2)
Wv = np.random.randn(4, 2)
Q = X @ Wq
K = X @ Wk
V = X @ Wv
scores = Q @ K.T
def softmax(x):
e = np.exp(x - np.max(x))
return e / e.sum(axis=-1, keepdims=True)
A = softmax(scores)
context = A @ V
print("Context vectors")
print(context)
Why context vectors are powerful
The attention mechanism behaves like three classical ML ideas combined: nearest neighbors, kernel machines, and memory retrieval. Each view highlights a different intuition for why the context vector is so effective.
Attention as nearest neighbors
Attention looks for tokens with similar keys, which is conceptually similar to KNN.
Idea
Query = search vector Keys = stored points
Similarity:
similarity = q · k
The highest score means the closest neighbor in semantic space.
Python example
import numpy as np
query = np.array([1, 2])
keys = np.array([
[1, 2],
[2, 0],
[-1, 3]
])
scores = keys @ query
print(scores)
Difference from KNN:
- KNN picks the top k neighbors.
- Attention takes a weighted average of all neighbors.
Attention as kernel machines
Kernel methods measure similarity in a feature space. The classic dot product kernel is:
K(x, y) = x^T y
Attention uses the same idea:
similarity = q^T k
But instead of a fixed kernel, the model learns projection matrices:
q = x Wq
k = x Wk
That means the model learns its own kernel function.
Python kernel example
def kernel(x, y):
return np.dot(x, y)
import numpy as np
a = np.array([1, 2])
b = np.array([2, 3])
print(kernel(a, b))
Attention as memory retrieval
Another perspective is memory lookup:
Keys = memory addresses Values = stored information Query = what we are looking for
Steps:
- Compare query with memory keys.
- Compute weights.
- Retrieve weighted values.
Python memory example
import numpy as np
keys = np.array([
[1, 0],
[0, 1],
[1, 1]
])
values = np.array([
[10],
[20],
[30]
])
query = np.array([1, 1])
scores = keys @ query
weights = scores / scores.sum()
retrieved = weights @ values
print(retrieved)
Big picture
Attention combines three ideas into one pipeline:
| Concept | Role |
|---|---|
| Nearest neighbors | Find relevant tokens |
| Kernel machines | Compute similarity |
| Memory retrieval | Fetch useful information |
Transformers scale this idea massively by comparing every token with every other token, which is why attention has O(n^2) complexity in the number of tokens.
Final mental model
Attention works like:
Search -> Weight -> Retrieve
or
QK^T -> Softmax -> AV
Meaning:
- Decide where to look.
- Decide how much to trust each source.
- Combine the information.
The output of this pipeline is the context vector.