May 12, 2023

Matrix Multiplication for AI: Notes and Intuition

While learning AI and transformers, I realized something interesting: most AI systems are basically doing vector-by-matrix and matrix-by-matrix multiplications over and over again. If you understand matrix multiplication deeply, a lot of ML suddenly becomes much easier.

Why Matrix Multiplication Feels Core to AI

Examples I keep seeing everywhere:

AI System	Formula
Linear regression	y = Xw
Neural network layer	y = Wx + b
Transformer queries	Q = XW_Q
Transformer keys	K = XW_K
Transformer values	V = XW_V
Attention scores	QK^T
Context mixing	softmax(QK^T)V

So most deep learning is basically:

matrix multiplication + nonlinear activation.

What Matrix Multiplication Really Means (My Intuition)

I try to think of matrix multiplication in a few ways:

View	Meaning
Linear combination	weighted sum of features
Projection	move data to a new space
Feature mixing	combine information
Geometry	rotation / scaling of vectors

Example:

Embedding -> attention space

Q = XW_Q

Here the weight matrix is basically deciding how to mix embedding features.

Python Walkthrough (All Concepts)

import numpy as np

# 1) Linear regression: y = Xw
X = np.random.randn(4, 3)   # 4 samples, 3 features
w = np.random.randn(3, 1)
y = X @ w

# 2) Neural network layer: y = Wx + b
W = np.random.randn(5, 3)   # 5 outputs, 3 inputs
b = np.random.randn(5, 1)
x = np.random.randn(3, 1)
layer_out = W @ x + b

# 3) Linear combination (feature mixing)
v1 = np.array([[1.0], [0.0], [0.0]])
v2 = np.array([[0.0], [1.0], [0.0]])
basis = np.concatenate([v1, v2], axis=1)  # 3 x 2
coeffs = np.array([[2.0], [3.0]])         # 2 x 1
combo = basis @ coeffs                    # 3 x 1

# 4) Projection onto a subspace (orthonormal basis U)
U, _ = np.linalg.qr(np.random.randn(3, 2))  # 3 x 2, columns orthonormal
v = np.random.randn(3, 1)
proj = U @ (U.T @ v)

# 5) Geometry (2D rotation)
theta = np.deg2rad(30)
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])
v2d = np.array([[1.0], [0.0]])
rotated = R @ v2d

# 6) Transformers: Q, K, V and attention
n, d, dk, dv = 3, 4, 4, 4
X = np.random.randn(n, d)
Wq = np.random.randn(d, dk)
Wk = np.random.randn(d, dk)
Wv = np.random.randn(d, dv)

Q = X @ Wq
K = X @ Wk
V = X @ Wv

scores = Q @ K.T
def softmax(a):
  a = a - a.max(axis=-1, keepdims=True)
  exp = np.exp(a)
  return exp / exp.sum(axis=-1, keepdims=True)

A = softmax(scores)
context = A @ V

# 7) Shape reasoning example
shape_example = (np.random.randn(1024, 4096) @ np.random.randn(4096, 4096)).shape

What AI Researchers Are Comfortable With

Strong ML engineers can immediately reason about shapes like:

(1024 x 4096) x (4096 x 4096)

They know instantly:

compute cost
memory cost
resulting tensor shape

Understanding shapes is extremely important in transformers.

My Personal Take

If someone truly understands:

matrix multiplication
dot products
vector projections
gradients

they probably understand 60-70% of modern deep learning math already.

The rest is mainly:

optimization
probability
architecture design

Mental Model I Like

Matrix multiplication = information mixing.

Every neural network layer is basically mixing information between dimensions.

That simple idea explains most of deep learning.

← Older

Yul and Assemly

Newer →

AI Math Cheat Sheet for Transformers