Mar 12, 2026

Understanding the Context Vector in Attention

Attention can be understood as a three-stage pipeline that turns token relationships into a context vector. This post builds an intuition for each stage and shows minimal Python examples you can run end to end.

The three stages of attention

StageMathMeaning
Query-KeyQK^TDecide relevance
SoftmaxnormalizeChoose importance
Value mixingAVBuild context

Intuition

  1. Query-Key stage: each token asks which other tokens are relevant to it.
  2. Softmax stage: convert similarities into probabilities (importance weights).
  3. Value mixing stage: combine information from relevant tokens to build the context vector.

Final formula:

Context = softmax(QK^T) V

Minimal Python example of attention

import numpy as np

# 3 tokens, embedding dimension 4
X = np.array([
    [1, 0, 1, 0],  # how
    [0, 1, 1, 0],  # are
    [1, 1, 0, 1]   # you
])

# projection matrices
Wq = np.random.randn(4, 2)
Wk = np.random.randn(4, 2)
Wv = np.random.randn(4, 2)

Q = X @ Wq
K = X @ Wk
V = X @ Wv

scores = Q @ K.T

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum(axis=-1, keepdims=True)

A = softmax(scores)

context = A @ V

print("Context vectors")
print(context)

Why context vectors are powerful

The attention mechanism behaves like three classical ML ideas combined: nearest neighbors, kernel machines, and memory retrieval. Each view highlights a different intuition for why the context vector is so effective.

Attention as nearest neighbors

Attention looks for tokens with similar keys, which is conceptually similar to KNN.

Idea

Query = search vector Keys = stored points

Similarity:

similarity = q · k

The highest score means the closest neighbor in semantic space.

Python example

import numpy as np

query = np.array([1, 2])

keys = np.array([
    [1, 2],
    [2, 0],
    [-1, 3]
])

scores = keys @ query
print(scores)

Difference from KNN:

  • KNN picks the top k neighbors.
  • Attention takes a weighted average of all neighbors.

Attention as kernel machines

Kernel methods measure similarity in a feature space. The classic dot product kernel is:

K(x, y) = x^T y

Attention uses the same idea:

similarity = q^T k

But instead of a fixed kernel, the model learns projection matrices:

q = x Wq k = x Wk

That means the model learns its own kernel function.

Python kernel example

def kernel(x, y):
    return np.dot(x, y)

import numpy as np

a = np.array([1, 2])
b = np.array([2, 3])

print(kernel(a, b))

Attention as memory retrieval

Another perspective is memory lookup:

Keys = memory addresses Values = stored information Query = what we are looking for

Steps:

  1. Compare query with memory keys.
  2. Compute weights.
  3. Retrieve weighted values.

Python memory example

import numpy as np

keys = np.array([
    [1, 0],
    [0, 1],
    [1, 1]
])

values = np.array([
    [10],
    [20],
    [30]
])

query = np.array([1, 1])

scores = keys @ query
weights = scores / scores.sum()

retrieved = weights @ values

print(retrieved)

Big picture

Attention combines three ideas into one pipeline:

ConceptRole
Nearest neighborsFind relevant tokens
Kernel machinesCompute similarity
Memory retrievalFetch useful information

Transformers scale this idea massively by comparing every token with every other token, which is why attention has O(n^2) complexity in the number of tokens.

Final mental model

Attention works like:

Search -> Weight -> Retrieve

or

QK^T -> Softmax -> AV

Meaning:

  1. Decide where to look.
  2. Decide how much to trust each source.
  3. Combine the information.

The output of this pipeline is the context vector.


Thanks for reading! If you want to see future content, you can follow me on Twitter or get connected over at LinkedIn.


Support My Content

If you find my content helpful, consider supporting a humanitarian cause (building homes for elderly people in rural Terai region of Nepal) that I am planning with your donation:

Ethereum (ETH)

0xB62409A5B227D2aE7D8C66fdaA5EEf4eB4E37959

Thank you for your support!