May 24, 2026

Contextual Embeddings vs Static Embeddings: How Transformers Understand Meaning

Embeddings are how language models turn text into vectors. Once words or tokens become vectors, models can compare them, cluster them, retrieve them, classify them, and generate from them.

But there are two very different ideas hiding under the word "embedding":

  • Static embeddings: one word or token gets one vector, regardless of sentence.
  • Contextual embeddings: the vector changes depending on the surrounding words.

Open the interactive embeddings and Transformer flow demo

The Core Idea

The word bank can mean a financial institution:

The bank approved my loan.

Or it can mean the side of a river:

The fisherman sat on the river bank.

A static embedding gives bank the same vector in both sentences.

A contextual embedding gives bank different vectors because the surrounding words are different.

That is the whole shift from older word embeddings to modern Transformer representations.

Static Embeddings

Static embeddings map each word to a single learned vector.

"bank" -> [0.12, -0.44, 0.91, ...]
"river" -> [-0.33, 0.71, 0.28, ...]
"loan" -> [0.84, -0.19, 0.42, ...]

Classic examples:

  • Word2Vec
  • GloVe
  • FastText

These models learn from co-occurrence patterns. If two words appear in similar contexts, their vectors become similar.

Example:

king - man + woman ~= queen

This kind of structure made static embeddings powerful. They gave NLP systems a dense semantic space instead of sparse one-hot word IDs.

Why Static Embeddings Were a Breakthrough

Before embeddings, words were often represented as one-hot vectors:

apple  -> [1, 0, 0, 0, 0]
banana -> [0, 1, 0, 0, 0]
loan   -> [0, 0, 1, 0, 0]

One-hot vectors do not naturally encode similarity. apple and banana are as unrelated as apple and loan.

Static embeddings fix that:

cosine(apple, banana) > cosine(apple, loan)

They made it possible to represent semantic similarity, analogies, and word neighborhoods.

The Static Embedding Limitation

The limitation is ambiguity.

The word bank has one vector:

embedding("bank") = same vector every time

So these sentences reuse the same bank representation:

The bank approved my loan.
The fisherman sat on the river bank.

The surrounding words can still help later models make a decision, but the original word vector itself does not change.

That means static embeddings struggle with:

  • polysemy: one word with multiple meanings
  • context-specific meaning
  • syntax-sensitive roles
  • long-distance dependencies
  • words whose meaning changes by domain

Contextual Embeddings

Contextual embeddings solve the ambiguity problem.

Instead of:

word -> vector

they do:

word + surrounding context -> vector

So bank gets different vectors:

bank in "approved my loan" -> finance-like vector
bank in "river bank"       -> geography-like vector

The token is the same. The output vector is different.

Contextual Embeddings Are Token Instances

This is the most important practical detail:

A contextual embedding is not just a vector for a word. It is a vector for a word instance in a specific context.

In static embeddings:

embedding("bank") = one reusable vector

In contextual embeddings:

embedding("bank" in "The bank approved my loan")       = finance-bank instance vector
embedding("bank" in "The river bank was muddy")        = river-bank instance vector
embedding("bank" in "The central bank raised rates")   = central-bank instance vector
embedding("bank" in "The plane will bank left")        = motion-bank instance vector

Think of it like this:

static embedding     = word type vector
contextual embedding = token instance vector

A word type is the abstract vocabulary entry, like bank.

A token instance is one occurrence of that word in one sequence, like the second token in:

The bank approved my loan.

If the same word appears twice in one sentence, a Transformer can produce two different contextual embeddings:

The bank near the river was owned by a bank.
    ^ river-side meaning              ^ finance meaning

Same spelling. Same vocabulary item. Two different token instances. Two different contextual vectors.

That is why contextual embeddings are so useful for:

  • entity disambiguation
  • word sense disambiguation
  • coreference
  • search reranking
  • question answering
  • retrieval-augmented generation
  • next-token prediction

Transformer View

A Transformer starts with token embeddings, but those are only the beginning.

The high-level flow:

text
  -> tokens
  -> token embeddings
  -> positional information
  -> self-attention
  -> contextual token vectors
  -> feed-forward layers
  -> final hidden states

The first embedding lookup is static inside the model. The same token ID maps to the same initial vector.

The contextual part happens after attention mixes information across tokens.

How Self-Attention Creates Context

Self-attention lets each token ask:

Which other tokens in this sentence matter for understanding me?

For the sentence:

The bank approved my loan.

the token bank should attend to:

  • approved
  • loan

For:

The fisherman sat on the river bank.

the token bank should attend to:

  • river
  • fisherman
  • sat

Attention changes the bank vector by mixing in information from relevant neighboring tokens.

Q, K, V in Plain English

Inside a Transformer, each token vector is projected into:

  • Query (Q): what this token is looking for
  • Key (K): what this token offers for matching
  • Value (V): the information this token contributes if selected

For each token:

query = token_vector * Wq
key   = token_vector * Wk
value = token_vector * Wv

Attention scores:

scores = QK^T

Normalize:

weights = softmax(scores)

Mix values:

context = weights * V

That context vector is why the final representation is contextual.

Worked Example: bank

Static embedding:

bank -> [0.40, 0.12, -0.22, 0.71]

Finance sentence:

The bank approved my loan

Attention for bank might look like:

TokenWeight
bank0.30
approved0.25
loan0.35
the/my0.10

River sentence:

The fisherman sat on the river bank

Attention for bank might look like:

TokenWeight
bank0.25
river0.40
fisherman0.20
sat/on/the0.15

Same initial token embedding. Different attention weights. Different final contextual vector.

Static vs Contextual Embeddings

DimensionStatic embeddingsContextual embeddings
Vector per wordOne vectorDifferent vector per context
Handles ambiguityWeaklyStrongly
ExamplesWord2Vec, GloVe, FastTextELMo, BERT, GPT-style Transformers
Uses surrounding wordsDuring training onlyDuring every forward pass
Good forSimple similarity, classic NLP features, lightweight retrievalLanguage understanding, generation, question answering, reranking
CostCheapMore expensive

How Transformers Use Contextual Embeddings

Modern Transformer language models produce contextual hidden states at every layer.

For an encoder model like BERT:

input sentence
  -> contextual hidden states
  -> classification / token labeling / retrieval embedding

For a decoder model like GPT:

previous tokens
  -> contextual hidden state for next-token prediction
  -> probability distribution over next token

In both cases, the hidden state for each token depends on surrounding tokens.

Example: Sentiment

Sentence:

The movie was not good.

Static embeddings know good is positive, but they do not change good because of not.

A contextual model can make good attend to not, changing the representation.

That is why contextual embeddings are better for:

  • negation
  • sarcasm
  • scope
  • entity disambiguation
  • grammar-sensitive meaning

Example: Retrieval

Static embedding search:

query: "apple stock price"

The word apple has one vector. It may be pulled toward fruit and company meanings.

Contextual retrieval model:

"apple" near "stock price" -> company meaning
"apple" near "pie recipe"  -> fruit meaning

This helps semantic search and retrieval-augmented generation.

Embedding Levels

Embeddings can exist at several levels:

Token Embedding

One vector per token.

bank -> vector

Contextual Token Embedding

One vector per token in a sentence.

bank in sentence A -> vector A
bank in sentence B -> vector B

Sentence Embedding

One vector for a whole sentence.

"The bank approved my loan." -> vector

Document Embedding

One vector for a larger chunk.

paragraph / page / document -> vector

Transformer systems often build sentence or document embeddings by pooling contextual token vectors.

Practical Uses

Use static embeddings when:

  • the task is simple
  • latency matters more than nuance
  • you need lightweight similarity
  • vocabulary is stable
  • context is not central

Use contextual embeddings when:

  • meaning depends on context
  • words are ambiguous
  • you need ranking or generation quality
  • you are doing question answering
  • you need semantic retrieval
  • you are using Transformer models

Mental Model

Static embedding:

dictionary lookup

Contextual embedding:

dictionary lookup + sentence-aware attention

Transformer:

token embeddings
  -> self-attention mixes context
  -> contextual hidden states
  -> task head or next-token prediction

Final Takeaway

Static embeddings know that words have meaning.

Contextual embeddings know that meaning changes in context.

Transformers work because they start with token vectors, then repeatedly rewrite those vectors by letting each token attend to other tokens.

References


Thanks for reading! If you want to see future content, you can follow me on Twitter or get connected over at LinkedIn.


Support My Content

If you find my content helpful, consider supporting a humanitarian cause (building homes for elderly people in rural Terai region of Nepal) that I am planning with your donation:

Ethereum (ETH)

0xB62409A5B227D2aE7D8C66fdaA5EEf4eB4E37959

Thank you for your support!