May 24, 2026

Contextual Embeddings vs Static Embeddings: How Transformers Understand Meaning

Embeddings are how language models turn text into vectors. Once words or tokens become vectors, models can compare them, cluster them, retrieve them, classify them, and generate from them.

But there are two very different ideas hiding under the word "embedding":

Static embeddings: one word or token gets one vector, regardless of sentence.
Contextual embeddings: the vector changes depending on the surrounding words.

Open the interactive embeddings and Transformer flow demo

The Core Idea

The word bank can mean a financial institution:

The bank approved my loan.

Or it can mean the side of a river:

The fisherman sat on the river bank.

A static embedding gives bank the same vector in both sentences.

A contextual embedding gives bank different vectors because the surrounding words are different.

That is the whole shift from older word embeddings to modern Transformer representations.

Static Embeddings

Static embeddings map each word to a single learned vector.

"bank" -> [0.12, -0.44, 0.91, ...]
"river" -> [-0.33, 0.71, 0.28, ...]
"loan" -> [0.84, -0.19, 0.42, ...]

Classic examples:

Word2Vec
GloVe
FastText

These models learn from co-occurrence patterns. If two words appear in similar contexts, their vectors become similar.

Example:

king - man + woman ~= queen

This kind of structure made static embeddings powerful. They gave NLP systems a dense semantic space instead of sparse one-hot word IDs.

Why Static Embeddings Were a Breakthrough

Before embeddings, words were often represented as one-hot vectors:

apple  -> [1, 0, 0, 0, 0]
banana -> [0, 1, 0, 0, 0]
loan   -> [0, 0, 1, 0, 0]

One-hot vectors do not naturally encode similarity. apple and banana are as unrelated as apple and loan.

Static embeddings fix that:

cosine(apple, banana) > cosine(apple, loan)

They made it possible to represent semantic similarity, analogies, and word neighborhoods.

The Static Embedding Limitation

The limitation is ambiguity.

The word bank has one vector:

embedding("bank") = same vector every time

So these sentences reuse the same bank representation:

The bank approved my loan.
The fisherman sat on the river bank.

The surrounding words can still help later models make a decision, but the original word vector itself does not change.

That means static embeddings struggle with:

polysemy: one word with multiple meanings
context-specific meaning
syntax-sensitive roles
long-distance dependencies
words whose meaning changes by domain

Contextual Embeddings

Contextual embeddings solve the ambiguity problem.

Instead of:

word -> vector

they do:

word + surrounding context -> vector

So bank gets different vectors:

bank in "approved my loan" -> finance-like vector
bank in "river bank"       -> geography-like vector

The token is the same. The output vector is different.

Contextual Embeddings Are Token Instances

This is the most important practical detail:

A contextual embedding is not just a vector for a word. It is a vector for a word instance in a specific context.

In static embeddings:

embedding("bank") = one reusable vector

In contextual embeddings:

embedding("bank" in "The bank approved my loan")       = finance-bank instance vector
embedding("bank" in "The river bank was muddy")        = river-bank instance vector
embedding("bank" in "The central bank raised rates")   = central-bank instance vector
embedding("bank" in "The plane will bank left")        = motion-bank instance vector

Think of it like this:

static embedding     = word type vector
contextual embedding = token instance vector

A word type is the abstract vocabulary entry, like bank.

A token instance is one occurrence of that word in one sequence, like the second token in:

The bank approved my loan.

If the same word appears twice in one sentence, a Transformer can produce two different contextual embeddings:

The bank near the river was owned by a bank.
    ^ river-side meaning              ^ finance meaning

Same spelling. Same vocabulary item. Two different token instances. Two different contextual vectors.

That is why contextual embeddings are so useful for:

entity disambiguation
word sense disambiguation
coreference
search reranking
question answering
retrieval-augmented generation
next-token prediction

Transformer View

A Transformer starts with token embeddings, but those are only the beginning.

The high-level flow:

text
  -> tokens
  -> token embeddings
  -> positional information
  -> self-attention
  -> contextual token vectors
  -> feed-forward layers
  -> final hidden states

The first embedding lookup is static inside the model. The same token ID maps to the same initial vector.

The contextual part happens after attention mixes information across tokens.

How Self-Attention Creates Context

Self-attention lets each token ask:

Which other tokens in this sentence matter for understanding me?

For the sentence:

The bank approved my loan.

the token bank should attend to:

approved
loan

For:

The fisherman sat on the river bank.

the token bank should attend to:

river
fisherman
sat

Attention changes the bank vector by mixing in information from relevant neighboring tokens.

Q, K, V in Plain English

Inside a Transformer, each token vector is projected into:

Query (Q): what this token is looking for
Key (K): what this token offers for matching
Value (V): the information this token contributes if selected

For each token:

query = token_vector * Wq
key   = token_vector * Wk
value = token_vector * Wv

Attention scores:

scores = QK^T

Normalize:

weights = softmax(scores)

Mix values:

context = weights * V

That context vector is why the final representation is contextual.

Worked Example: `bank`

Static embedding:

bank -> [0.40, 0.12, -0.22, 0.71]

Finance sentence:

The bank approved my loan

Attention for bank might look like:

Token	Weight
bank	0.30
approved	0.25
loan	0.35
the/my	0.10

River sentence:

The fisherman sat on the river bank

Attention for bank might look like:

Token	Weight
bank	0.25
river	0.40
fisherman	0.20
sat/on/the	0.15

Same initial token embedding. Different attention weights. Different final contextual vector.

Static vs Contextual Embeddings

Dimension	Static embeddings	Contextual embeddings
Vector per word	One vector	Different vector per context
Handles ambiguity	Weakly	Strongly
Examples	Word2Vec, GloVe, FastText	ELMo, BERT, GPT-style Transformers
Uses surrounding words	During training only	During every forward pass
Good for	Simple similarity, classic NLP features, lightweight retrieval	Language understanding, generation, question answering, reranking
Cost	Cheap	More expensive

How Transformers Use Contextual Embeddings

Modern Transformer language models produce contextual hidden states at every layer.

For an encoder model like BERT:

input sentence
  -> contextual hidden states
  -> classification / token labeling / retrieval embedding

For a decoder model like GPT:

previous tokens
  -> contextual hidden state for next-token prediction
  -> probability distribution over next token

In both cases, the hidden state for each token depends on surrounding tokens.

Example: Sentiment

Sentence:

The movie was not good.

Static embeddings know good is positive, but they do not change good because of not.

A contextual model can make good attend to not, changing the representation.

That is why contextual embeddings are better for:

negation
sarcasm
scope
entity disambiguation
grammar-sensitive meaning

Example: Retrieval

Static embedding search:

query: "apple stock price"

The word apple has one vector. It may be pulled toward fruit and company meanings.

Contextual retrieval model:

"apple" near "stock price" -> company meaning
"apple" near "pie recipe"  -> fruit meaning

This helps semantic search and retrieval-augmented generation.

Embedding Levels

Embeddings can exist at several levels:

Token Embedding

One vector per token.

bank -> vector

Contextual Token Embedding

One vector per token in a sentence.

bank in sentence A -> vector A
bank in sentence B -> vector B

Sentence Embedding

One vector for a whole sentence.

"The bank approved my loan." -> vector

Document Embedding

One vector for a larger chunk.

paragraph / page / document -> vector

Transformer systems often build sentence or document embeddings by pooling contextual token vectors.

Practical Uses

Use static embeddings when:

the task is simple
latency matters more than nuance
you need lightweight similarity
vocabulary is stable
context is not central

Use contextual embeddings when:

meaning depends on context
words are ambiguous
you need ranking or generation quality
you are doing question answering
you need semantic retrieval
you are using Transformer models

Mental Model

Static embedding:

dictionary lookup

Contextual embedding:

dictionary lookup + sentence-aware attention

Transformer:

token embeddings
  -> self-attention mixes context
  -> contextual hidden states
  -> task head or next-token prediction

Final Takeaway

Static embeddings know that words have meaning.

Contextual embeddings know that meaning changes in context.

Transformers work because they start with token vectors, then repeatedly rewrite those vectors by letting each token attend to other tokens.

References

← Older

When to Use Mimir, Cortex, or Thanos?

Newer →

Mimir: Scalable Prometheus with Interactive Architecture

Contextual Embeddings vs Static Embeddings: How Transformers Understand Meaning

The Core Idea

Static Embeddings

Why Static Embeddings Were a Breakthrough

The Static Embedding Limitation

Contextual Embeddings

Contextual Embeddings Are Token Instances

Transformer View

How Self-Attention Creates Context

Q, K, V in Plain English

Worked Example: `bank`

Static vs Contextual Embeddings

How Transformers Use Contextual Embeddings

Example: Sentiment

Example: Retrieval

Embedding Levels

Token Embedding

Contextual Token Embedding

Sentence Embedding

Document Embedding

Practical Uses

Mental Model

Final Takeaway

References

Support My Content

Ethereum (ETH)

Contextual Embeddings vs Static Embeddings: How Transformers Understand Meaning

The Core Idea

Static Embeddings

Why Static Embeddings Were a Breakthrough

The Static Embedding Limitation

Contextual Embeddings

Contextual Embeddings Are Token Instances

Transformer View

How Self-Attention Creates Context

Q, K, V in Plain English

Worked Example: bank

Static vs Contextual Embeddings

How Transformers Use Contextual Embeddings

Example: Sentiment

Example: Retrieval

Embedding Levels

Token Embedding

Contextual Token Embedding

Sentence Embedding

Document Embedding

Practical Uses

Mental Model

Final Takeaway

References

Support My Content

Ethereum (ETH)

Worked Example: `bank`