Contextual Embeddings vs Static Embeddings: How Transformers Understand Meaning
Embeddings are how language models turn text into vectors. Once words or tokens become vectors, models can compare them, cluster them, retrieve them, classify them, and generate from them.
But there are two very different ideas hiding under the word "embedding":
- Static embeddings: one word or token gets one vector, regardless of sentence.
- Contextual embeddings: the vector changes depending on the surrounding words.
Open the interactive embeddings and Transformer flow demo
The Core Idea
The word bank can mean a financial institution:
The bank approved my loan.
Or it can mean the side of a river:
The fisherman sat on the river bank.
A static embedding gives bank the same vector in both sentences.
A contextual embedding gives bank different vectors because the surrounding words are different.
That is the whole shift from older word embeddings to modern Transformer representations.
Static Embeddings
Static embeddings map each word to a single learned vector.
"bank" -> [0.12, -0.44, 0.91, ...]
"river" -> [-0.33, 0.71, 0.28, ...]
"loan" -> [0.84, -0.19, 0.42, ...]
Classic examples:
- Word2Vec
- GloVe
- FastText
These models learn from co-occurrence patterns. If two words appear in similar contexts, their vectors become similar.
Example:
king - man + woman ~= queen
This kind of structure made static embeddings powerful. They gave NLP systems a dense semantic space instead of sparse one-hot word IDs.
Why Static Embeddings Were a Breakthrough
Before embeddings, words were often represented as one-hot vectors:
apple -> [1, 0, 0, 0, 0]
banana -> [0, 1, 0, 0, 0]
loan -> [0, 0, 1, 0, 0]
One-hot vectors do not naturally encode similarity. apple and banana are as unrelated as apple and loan.
Static embeddings fix that:
cosine(apple, banana) > cosine(apple, loan)
They made it possible to represent semantic similarity, analogies, and word neighborhoods.
The Static Embedding Limitation
The limitation is ambiguity.
The word bank has one vector:
embedding("bank") = same vector every time
So these sentences reuse the same bank representation:
The bank approved my loan.
The fisherman sat on the river bank.
The surrounding words can still help later models make a decision, but the original word vector itself does not change.
That means static embeddings struggle with:
- polysemy: one word with multiple meanings
- context-specific meaning
- syntax-sensitive roles
- long-distance dependencies
- words whose meaning changes by domain
Contextual Embeddings
Contextual embeddings solve the ambiguity problem.
Instead of:
word -> vector
they do:
word + surrounding context -> vector
So bank gets different vectors:
bank in "approved my loan" -> finance-like vector
bank in "river bank" -> geography-like vector
The token is the same. The output vector is different.
Contextual Embeddings Are Token Instances
This is the most important practical detail:
A contextual embedding is not just a vector for a word. It is a vector for a word instance in a specific context.
In static embeddings:
embedding("bank") = one reusable vector
In contextual embeddings:
embedding("bank" in "The bank approved my loan") = finance-bank instance vector
embedding("bank" in "The river bank was muddy") = river-bank instance vector
embedding("bank" in "The central bank raised rates") = central-bank instance vector
embedding("bank" in "The plane will bank left") = motion-bank instance vector
Think of it like this:
static embedding = word type vector
contextual embedding = token instance vector
A word type is the abstract vocabulary entry, like bank.
A token instance is one occurrence of that word in one sequence, like the second token in:
The bank approved my loan.
If the same word appears twice in one sentence, a Transformer can produce two different contextual embeddings:
The bank near the river was owned by a bank.
^ river-side meaning ^ finance meaning
Same spelling. Same vocabulary item. Two different token instances. Two different contextual vectors.
That is why contextual embeddings are so useful for:
- entity disambiguation
- word sense disambiguation
- coreference
- search reranking
- question answering
- retrieval-augmented generation
- next-token prediction
Transformer View
A Transformer starts with token embeddings, but those are only the beginning.
The high-level flow:
text
-> tokens
-> token embeddings
-> positional information
-> self-attention
-> contextual token vectors
-> feed-forward layers
-> final hidden states
The first embedding lookup is static inside the model. The same token ID maps to the same initial vector.
The contextual part happens after attention mixes information across tokens.
How Self-Attention Creates Context
Self-attention lets each token ask:
Which other tokens in this sentence matter for understanding me?
For the sentence:
The bank approved my loan.
the token bank should attend to:
approvedloan
For:
The fisherman sat on the river bank.
the token bank should attend to:
riverfishermansat
Attention changes the bank vector by mixing in information from relevant neighboring tokens.
Q, K, V in Plain English
Inside a Transformer, each token vector is projected into:
- Query (Q): what this token is looking for
- Key (K): what this token offers for matching
- Value (V): the information this token contributes if selected
For each token:
query = token_vector * Wq
key = token_vector * Wk
value = token_vector * Wv
Attention scores:
scores = QK^T
Normalize:
weights = softmax(scores)
Mix values:
context = weights * V
That context vector is why the final representation is contextual.
Worked Example: bank
Static embedding:
bank -> [0.40, 0.12, -0.22, 0.71]
Finance sentence:
The bank approved my loan
Attention for bank might look like:
| Token | Weight |
|---|---|
| bank | 0.30 |
| approved | 0.25 |
| loan | 0.35 |
| the/my | 0.10 |
River sentence:
The fisherman sat on the river bank
Attention for bank might look like:
| Token | Weight |
|---|---|
| bank | 0.25 |
| river | 0.40 |
| fisherman | 0.20 |
| sat/on/the | 0.15 |
Same initial token embedding. Different attention weights. Different final contextual vector.
Static vs Contextual Embeddings
| Dimension | Static embeddings | Contextual embeddings |
|---|---|---|
| Vector per word | One vector | Different vector per context |
| Handles ambiguity | Weakly | Strongly |
| Examples | Word2Vec, GloVe, FastText | ELMo, BERT, GPT-style Transformers |
| Uses surrounding words | During training only | During every forward pass |
| Good for | Simple similarity, classic NLP features, lightweight retrieval | Language understanding, generation, question answering, reranking |
| Cost | Cheap | More expensive |
How Transformers Use Contextual Embeddings
Modern Transformer language models produce contextual hidden states at every layer.
For an encoder model like BERT:
input sentence
-> contextual hidden states
-> classification / token labeling / retrieval embedding
For a decoder model like GPT:
previous tokens
-> contextual hidden state for next-token prediction
-> probability distribution over next token
In both cases, the hidden state for each token depends on surrounding tokens.
Example: Sentiment
Sentence:
The movie was not good.
Static embeddings know good is positive, but they do not change good because of not.
A contextual model can make good attend to not, changing the representation.
That is why contextual embeddings are better for:
- negation
- sarcasm
- scope
- entity disambiguation
- grammar-sensitive meaning
Example: Retrieval
Static embedding search:
query: "apple stock price"
The word apple has one vector. It may be pulled toward fruit and company meanings.
Contextual retrieval model:
"apple" near "stock price" -> company meaning
"apple" near "pie recipe" -> fruit meaning
This helps semantic search and retrieval-augmented generation.
Embedding Levels
Embeddings can exist at several levels:
Token Embedding
One vector per token.
bank -> vector
Contextual Token Embedding
One vector per token in a sentence.
bank in sentence A -> vector A
bank in sentence B -> vector B
Sentence Embedding
One vector for a whole sentence.
"The bank approved my loan." -> vector
Document Embedding
One vector for a larger chunk.
paragraph / page / document -> vector
Transformer systems often build sentence or document embeddings by pooling contextual token vectors.
Practical Uses
Use static embeddings when:
- the task is simple
- latency matters more than nuance
- you need lightweight similarity
- vocabulary is stable
- context is not central
Use contextual embeddings when:
- meaning depends on context
- words are ambiguous
- you need ranking or generation quality
- you are doing question answering
- you need semantic retrieval
- you are using Transformer models
Mental Model
Static embedding:
dictionary lookup
Contextual embedding:
dictionary lookup + sentence-aware attention
Transformer:
token embeddings
-> self-attention mixes context
-> contextual hidden states
-> task head or next-token prediction
Final Takeaway
Static embeddings know that words have meaning.
Contextual embeddings know that meaning changes in context.
Transformers work because they start with token vectors, then repeatedly rewrite those vectors by letting each token attend to other tokens.