Jun 6, 2023

AI Math Cheat Sheet for Transformers

These are notes I wrote while trying to understand the math behind transformers and ML.

Dot Product vs Inner Product vs Outer Product

Dot Product

Two vectors multiplied element-by-element and summed.

a dot b = a1b1 + a2b2 + ... + anbn

Requirements:

  • vectors must have same length

Complexity:

O(n)

Used for:

  • similarity
  • attention scores
  • projections

Example in transformers:

score = Q dot K

Inner Product

Inner product is the general mathematical concept. For normal vectors, the inner product equals the dot product.

Output:

scalar (single number)

Outer Product

Outer product multiplies every element with every element.

If:

a length = m
b length = n

Result shape:

m x n

Example structure:

b1b2b3
a1a1b1a1b2a1b3
a2a2b1a2b2a2b3

No dimension matching required.

Used in ML for:

  • neural network gradients
  • covariance matrices
  • matrix factorization

Python Walkthrough (All Concepts)

import numpy as np

# Dot product (same as inner product for standard vectors)
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
dot = a @ b

# Outer product
outer = np.outer(a, b)  # 3 x 3

# Matrix multiplication intuition
A = np.random.randn(3, 4)
B = np.random.randn(4, 3)
C = A @ B               # 3 x 3

# QKV attention pipeline
n, d, dk, dv = 3, 4, 4, 4
X = np.random.randn(n, d)
Wq = np.random.randn(d, dk)
Wk = np.random.randn(d, dk)
Wv = np.random.randn(d, dv)

Q = X @ Wq
K = X @ Wk
V = X @ Wv

scores = Q @ K.T
def softmax(x):
  x = x - x.max(axis=-1, keepdims=True)
  exp = np.exp(x)
  return exp / exp.sum(axis=-1, keepdims=True)

attn = softmax(scores)
context = attn @ V

# Why attention is O(n^2)
n = 5
comparisons = n * n

Matrix Multiplication Intuition

Rule:

(m x n)(n x p) -> (m x p)

The inner dimension n represents the feature space that interacts.

Example:

player stats matrix:

3 x 4

weight matrix:

4 x 3

Result:

3 x 3

Each output is a weighted combination of features.

QKV Attention Pipeline

Transformers compute attention in a few steps.

  1. Create queries, keys, values

Q = XW_Q
K = XW_K
V = XW_V

  1. Compute similarity

scores = QK^T

This produces a token-to-token similarity matrix.

  1. Normalize

softmax(scores)

Now we get attention weights.

  1. Mix information

output = softmax(QK^T)V

This gives contextual embeddings.

Why Attention is O(n^2)

If there are n tokens, each token compares with every other token.

Similarity matrix size:

n x n

Example:

3 tokens

t1t2t3
t1scorescorescore
t2scorescorescore
t3scorescorescore

Total comparisons:

n^2

That is why attention becomes expensive for long sequences.

My AI Math Roadmap

If I want to get really strong in AI math, these are the areas to master:

Linear Algebra

  • matrix multiplication
  • eigenvectors
  • SVD
  • projections

Probability

  • expectation
  • variance
  • Bayes rule
  • likelihood

Optimization

  • gradient descent
  • backpropagation
  • loss landscapes

Deep Learning Concepts

  • CNNs
  • RNNs
  • Transformers
  • attention

Systems Knowledge

  • GPU computation
  • batching
  • memory layout

Personal Insight

Matrix multiplication is the backbone of deep learning.

Transformers are basically dynamic matrix multiplications where weights depend on the input.


Thanks for reading! If you want to see future content, you can follow me on Twitter or get connected over at LinkedIn.


Support My Content

If you find my content helpful, consider supporting a humanitarian cause (building homes for elderly people in rural Terai region of Nepal) that I am planning with your donation:

Ethereum (ETH)

0xB62409A5B227D2aE7D8C66fdaA5EEf4eB4E37959

Thank you for your support!