Jun 6, 2023

AI Math Cheat Sheet for Transformers

These are notes I wrote while trying to understand the math behind transformers and ML.

Dot Product vs Inner Product vs Outer Product

Dot Product

Two vectors multiplied element-by-element and summed.

a dot b = a1b1 + a2b2 + ... + anbn

Requirements:

vectors must have same length

Complexity:

O(n)

Used for:

similarity
attention scores
projections

Example in transformers:

score = Q dot K

Inner Product

Inner product is the general mathematical concept. For normal vectors, the inner product equals the dot product.

Output:

scalar (single number)

Outer Product

Outer product multiplies every element with every element.

If:

a length = m
b length = n

Result shape:

m x n

Example structure:

	b1	b2	b3
a1	a1b1	a1b2	a1b3
a2	a2b1	a2b2	a2b3

No dimension matching required.

Used in ML for:

neural network gradients
covariance matrices
matrix factorization

Python Walkthrough (All Concepts)

import numpy as np

# Dot product (same as inner product for standard vectors)
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
dot = a @ b

# Outer product
outer = np.outer(a, b)  # 3 x 3

# Matrix multiplication intuition
A = np.random.randn(3, 4)
B = np.random.randn(4, 3)
C = A @ B               # 3 x 3

# QKV attention pipeline
n, d, dk, dv = 3, 4, 4, 4
X = np.random.randn(n, d)
Wq = np.random.randn(d, dk)
Wk = np.random.randn(d, dk)
Wv = np.random.randn(d, dv)

Q = X @ Wq
K = X @ Wk
V = X @ Wv

scores = Q @ K.T
def softmax(x):
  x = x - x.max(axis=-1, keepdims=True)
  exp = np.exp(x)
  return exp / exp.sum(axis=-1, keepdims=True)

attn = softmax(scores)
context = attn @ V

# Why attention is O(n^2)
n = 5
comparisons = n * n

Matrix Multiplication Intuition

Rule:

(m x n)(n x p) -> (m x p)

The inner dimension n represents the feature space that interacts.

Example:

player stats matrix:

3 x 4

weight matrix:

4 x 3

Result:

3 x 3

Each output is a weighted combination of features.

QKV Attention Pipeline

Transformers compute attention in a few steps.

Create queries, keys, values

Q = XW_Q
K = XW_K
V = XW_V

Compute similarity

scores = QK^T

This produces a token-to-token similarity matrix.

Normalize

softmax(scores)

Now we get attention weights.

Mix information

output = softmax(QK^T)V

This gives contextual embeddings.

Why Attention is O(n^2)

If there are n tokens, each token compares with every other token.

Similarity matrix size:

n x n

Example:

3 tokens

	t1	t2	t3
t1	score	score	score
t2	score	score	score
t3	score	score	score

Total comparisons:

n^2

That is why attention becomes expensive for long sequences.

My AI Math Roadmap

If I want to get really strong in AI math, these are the areas to master:

Linear Algebra

matrix multiplication
eigenvectors
SVD
projections

Probability

expectation
variance
Bayes rule
likelihood

Optimization

gradient descent
backpropagation
loss landscapes

Deep Learning Concepts

CNNs
RNNs
Transformers
attention

Systems Knowledge

GPU computation
batching
memory layout

Personal Insight

Matrix multiplication is the backbone of deep learning.

Transformers are basically dynamic matrix multiplications where weights depend on the input.

← Older

Matrix Multiplication for AI: Notes and Intuition

Newer →

Uniswap 4.0