Feb 4, 2026

Tokenization in Modern NLP and LLMs

Tokenization converts raw text into discrete units (tokens) and maps them to integer IDs. This step determines how efficiently a model uses its context window, how large embeddings need to be, and how well the system generalizes to new inputs.

Introduction

Neural networks cannot process raw text directly. A tokenizer transforms text like:

"I love Nepal"

into a sequence of IDs such as:

[1045, 2293, 8224, ...]

Tokenization affects:

Context window usage
Memory cost
Model size and capacity
Generalization to rare or new words

Evolution of Tokenization

Word-level

"I love Nepal" -> ["I", "love", "Nepal"]

Problems:

Large vocabulary
Out-of-vocabulary (OOV) words
Poor generalization

Subword tokenization

observability -> observ + ability

Benefits:

Smaller vocabulary
Handles unseen words
Better compression

Byte-level tokenization

Tokenization can operate directly on UTF-8 bytes.

Advantages:

No unknown tokens
Handles emojis and rare scripts
Language-agnostic behavior

Major Algorithms

Byte Pair Encoding (BPE)

Training algorithm:

Start with characters
Count most frequent adjacent pairs
Merge the most frequent pair
Repeat until the vocab size is reached

Example merges:

Initial: l o w l o w e r l o w e s t

Merge (l,o) -> lo Merge (lo,w) -> low

WordPiece

Used in BERT-style models.

Differences from BPE:

BPE merges the most frequent pairs
WordPiece merges the pair that maximizes likelihood improvement

Example:

playing -> play + ##ing

Unigram language model

Used in T5-style models.

Process:

Start with a large candidate vocabulary
Remove tokens that reduce likelihood the least
Keep the most probable tokens

SentencePiece

SentencePiece supports both BPE and Unigram and avoids pre-splitting on whitespace.

Example:

I love Nepal -> _I _love _Nepal (underscore marks whitespace)

Tokenization Pipeline

Raw Text
  -> Normalizer
  -> PreTokenizer
  -> Tokenization Model (BPE / WordPiece / Unigram)
  -> PostProcessor
  -> Token IDs
  -> Embedding Lookup

Architecture Diagrams

Where tokenization sits in the LLM stack

Raw Text
  -> Normalizer
  -> Tokenizer
  -> Token IDs
  -> Embedding Lookup
  -> Transformer Blocks
  -> Logits
  -> Sampling / Decoding
  -> Token IDs
  -> Detokenizer
  -> Output Text

Tokenizer training lifecycle

Corpus
  -> Cleaning / Normalization
  -> Pre-tokenization
  -> Trainer (BPE / WordPiece / Unigram)
  -> Vocab + Merges
  -> Tokenizer Config
  -> Evaluation
  -> Release

Key Design Choices

Normalization

Common choices:

Unicode normalization (NFC or NFKC)
Lowercasing
Accent stripping
Standardizing whitespace
Normalizing punctuation and quotes

These decisions change the vocabulary and can improve compression, but may lose information that is useful for tasks like names, code, or multilingual text.

Pre-tokenization

Typical strategies:

Whitespace splitting
Regex-based segmentation
Byte-level splitting for Unicode safety

Pre-tokenization determines which character sequences are eligible for merges and heavily influences how tokens align to words.

Vocabulary size

Tradeoffs:

Smaller vocab: more tokens per sentence, but smaller embeddings
Larger vocab: fewer tokens, but larger embeddings and more memory

A practical sweet spot often balances token count reduction with model size and training stability.

Special tokens

Common tokens and why they exist:

"[PAD]" for batch padding
"[UNK]" for unknown tokens (rare in byte-level models)
"[CLS]" and "[SEP]" for classifier or segment boundaries
"<BOS>" and "<EOS>" for generation control

Measuring Tokenizer Quality

Useful diagnostics:

Average tokens per word on a representative corpus
Percentage of unknown tokens (should be near zero for byte-level models)
Compression ratio vs. raw text
Fertility: average number of tokens per word
Error analysis on domain terms (code, biomedical, legal)

Better tokenization is not just fewer tokens. It is about stable segmentation, preserving meaning, and minimizing pathological splits for your domain.

Libraries and Ecosystem

Hugging Face Tokenizers: Rust implementation, fast, modular, supports BPE/WordPiece/Unigram
SentencePiece: C++ implementation, multilingual friendly, used widely in Google models
tiktoken: optimized byte-level BPE for GPT-style models

Practical Python Examples

Inspect tokens, ids, and offsets

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Tokenization matters for LLMs."

encoding = tokenizer(
    text,
    return_offsets_mapping=True,
    add_special_tokens=False,
)

print(encoding.tokens())
print(encoding["input_ids"])
print(encoding["offset_mapping"])

Train BPE (Hugging Face)

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=5000)

tokenizer.train(["data.txt"], trainer)

output = tokenizer.encode("I love observability")
print(output.tokens)
print(output.ids)

Train WordPiece (Hugging Face tokenizers)

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
trainer = WordPieceTrainer(vocab_size=8000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.train(["data.txt"], trainer)
print(tokenizer.encode("playing football").tokens)

Load a pretrained tokenizer (Transformers)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer("I love Nepal")
print(encoded["input_ids"])

SentencePiece training

import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input="data.txt",
    model_prefix="m",
    vocab_size=8000,
    model_type="unigram",
)

tiktoken for token counts

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

tokens = enc.encode("I love Nepal")
print(tokens)
print("num_tokens:", len(tokens))

Add special tokens and resize embeddings

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

specials = {"additional_special_tokens": ["<TOOL>", "</TOOL>"]}
num_added = tokenizer.add_special_tokens(specials)
model.resize_token_embeddings(len(tokenizer))

print("added:", num_added)

Pros and Cons

Method	Pros	Cons
Word-level	Simple	Huge vocab, OOV issues
BPE	Efficient, stable	Greedy merging
WordPiece	Probabilistic	Slightly slower
Unigram	Flexible	Training complexity
Byte-level BPE	No OOV, universal	More tokens for clean text

Tokenization and Transformers

Attention cost

Self-attention complexity is O(n^2). More tokens increase compute cost, so better compression directly improves efficiency.

Embedding impact

Vocabulary size influences:

Embedding matrix size
Model parameter count
Memory usage

Common Pitfalls

Training the tokenizer on a dataset that does not match the deployment domain
Over-normalizing and losing meaningful casing or punctuation
Very large vocabularies that bloat embeddings without reducing tokens enough
Inconsistent pre-tokenization between training and inference
Ignoring multilingual scripts or mixed-language inputs

Practical Heuristics

Measure tokens per document before and after any tokenizer change
Check segmentation of domain-specific words and code identifiers
Keep a small reserved space for new special tokens
Prefer byte-level models when unknown tokens are unacceptable
Re-evaluate tokenizer choices when the product domain shifts

Future of Tokenization

Token-free models
Dynamic segmentation
Neural compression
Joint tokenizer and model learning
Longer context windows

Final Takeaways

Tokenization is the foundation of:

Compression
Segmentation
Vocabulary construction
Transformer memory efficiency

Choosing the right tokenizer affects the entire model pipeline.

← Older

Prometheus: Practical Guide & Mental Model