AI Math Cheat Sheet for Transformers
These are notes I wrote while trying to understand the math behind transformers and ML.
Dot Product vs Inner Product vs Outer Product
Dot Product
Two vectors multiplied element-by-element and summed.
a dot b = a1b1 + a2b2 + ... + anbn
Requirements:
- vectors must have same length
Complexity:
O(n)
Used for:
- similarity
- attention scores
- projections
Example in transformers:
score = Q dot K
Inner Product
Inner product is the general mathematical concept. For normal vectors, the inner product equals the dot product.
Output:
scalar (single number)
Outer Product
Outer product multiplies every element with every element.
If:
a length = m
b length = n
Result shape:
m x n
Example structure:
| b1 | b2 | b3 | |
|---|---|---|---|
| a1 | a1b1 | a1b2 | a1b3 |
| a2 | a2b1 | a2b2 | a2b3 |
No dimension matching required.
Used in ML for:
- neural network gradients
- covariance matrices
- matrix factorization
Python Walkthrough (All Concepts)
import numpy as np
# Dot product (same as inner product for standard vectors)
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
dot = a @ b
# Outer product
outer = np.outer(a, b) # 3 x 3
# Matrix multiplication intuition
A = np.random.randn(3, 4)
B = np.random.randn(4, 3)
C = A @ B # 3 x 3
# QKV attention pipeline
n, d, dk, dv = 3, 4, 4, 4
X = np.random.randn(n, d)
Wq = np.random.randn(d, dk)
Wk = np.random.randn(d, dk)
Wv = np.random.randn(d, dv)
Q = X @ Wq
K = X @ Wk
V = X @ Wv
scores = Q @ K.T
def softmax(x):
x = x - x.max(axis=-1, keepdims=True)
exp = np.exp(x)
return exp / exp.sum(axis=-1, keepdims=True)
attn = softmax(scores)
context = attn @ V
# Why attention is O(n^2)
n = 5
comparisons = n * n
Matrix Multiplication Intuition
Rule:
(m x n)(n x p) -> (m x p)
The inner dimension n represents the feature space that interacts.
Example:
player stats matrix:
3 x 4
weight matrix:
4 x 3
Result:
3 x 3
Each output is a weighted combination of features.
QKV Attention Pipeline
Transformers compute attention in a few steps.
- Create queries, keys, values
Q = XW_Q
K = XW_K
V = XW_V
- Compute similarity
scores = QK^T
This produces a token-to-token similarity matrix.
- Normalize
softmax(scores)
Now we get attention weights.
- Mix information
output = softmax(QK^T)V
This gives contextual embeddings.
Why Attention is O(n^2)
If there are n tokens, each token compares with every other token.
Similarity matrix size:
n x n
Example:
3 tokens
| t1 | t2 | t3 | |
|---|---|---|---|
| t1 | score | score | score |
| t2 | score | score | score |
| t3 | score | score | score |
Total comparisons:
n^2
That is why attention becomes expensive for long sequences.
My AI Math Roadmap
If I want to get really strong in AI math, these are the areas to master:
Linear Algebra
- matrix multiplication
- eigenvectors
- SVD
- projections
Probability
- expectation
- variance
- Bayes rule
- likelihood
Optimization
- gradient descent
- backpropagation
- loss landscapes
Deep Learning Concepts
- CNNs
- RNNs
- Transformers
- attention
Systems Knowledge
- GPU computation
- batching
- memory layout
Personal Insight
Matrix multiplication is the backbone of deep learning.
Transformers are basically dynamic matrix multiplications where weights depend on the input.