Mar 12, 2026

Understanding the Context Vector in Attention

Attention can be understood as a three-stage pipeline that turns token relationships into a context vector. This post builds an intuition for each stage and shows minimal Python examples you can run end to end.

The three stages of attention

Stage	Math	Meaning
Query-Key	QK^T	Decide relevance
Softmax	normalize	Choose importance
Value mixing	AV	Build context

Intuition

Query-Key stage: each token asks which other tokens are relevant to it.
Softmax stage: convert similarities into probabilities (importance weights).
Value mixing stage: combine information from relevant tokens to build the context vector.

Final formula:

Context = softmax(QK^T) V

Minimal Python example of attention

import numpy as np

# 3 tokens, embedding dimension 4
X = np.array([
    [1, 0, 1, 0],  # how
    [0, 1, 1, 0],  # are
    [1, 1, 0, 1]   # you
])

# projection matrices
Wq = np.random.randn(4, 2)
Wk = np.random.randn(4, 2)
Wv = np.random.randn(4, 2)

Q = X @ Wq
K = X @ Wk
V = X @ Wv

scores = Q @ K.T

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum(axis=-1, keepdims=True)

A = softmax(scores)

context = A @ V

print("Context vectors")
print(context)

Why context vectors are powerful

The attention mechanism behaves like three classical ML ideas combined: nearest neighbors, kernel machines, and memory retrieval. Each view highlights a different intuition for why the context vector is so effective.

Attention as nearest neighbors

Attention looks for tokens with similar keys, which is conceptually similar to KNN.

Idea

Query = search vector Keys = stored points

Similarity:

similarity = q · k

The highest score means the closest neighbor in semantic space.

Python example

import numpy as np

query = np.array([1, 2])

keys = np.array([
    [1, 2],
    [2, 0],
    [-1, 3]
])

scores = keys @ query
print(scores)

Difference from KNN:

KNN picks the top k neighbors.
Attention takes a weighted average of all neighbors.

Attention as kernel machines

Kernel methods measure similarity in a feature space. The classic dot product kernel is:

K(x, y) = x^T y

Attention uses the same idea:

similarity = q^T k

But instead of a fixed kernel, the model learns projection matrices:

q = x Wq k = x Wk

That means the model learns its own kernel function.

Python kernel example

def kernel(x, y):
    return np.dot(x, y)

import numpy as np

a = np.array([1, 2])
b = np.array([2, 3])

print(kernel(a, b))

Attention as memory retrieval

Another perspective is memory lookup:

Keys = memory addresses Values = stored information Query = what we are looking for

Steps:

Compare query with memory keys.
Compute weights.
Retrieve weighted values.

Python memory example

import numpy as np

keys = np.array([
    [1, 0],
    [0, 1],
    [1, 1]
])

values = np.array([
    [10],
    [20],
    [30]
])

query = np.array([1, 1])

scores = keys @ query
weights = scores / scores.sum()

retrieved = weights @ values

print(retrieved)

Big picture

Attention combines three ideas into one pipeline:

Concept	Role
Nearest neighbors	Find relevant tokens
Kernel machines	Compute similarity
Memory retrieval	Fetch useful information

Transformers scale this idea massively by comparing every token with every other token, which is why attention has O(n^2) complexity in the number of tokens.