Tokenization in Modern NLP and LLMs
Tokenization converts raw text into discrete units (tokens) and maps them to integer IDs. This step determines how efficiently a model uses its context window, how large embeddings need to be, and how well the system generalizes to new inputs.
Introduction
Neural networks cannot process raw text directly. A tokenizer transforms text like:
"I love Nepal"
into a sequence of IDs such as:
[1045, 2293, 8224, ...]
Tokenization affects:
- Context window usage
- Memory cost
- Model size and capacity
- Generalization to rare or new words
Evolution of Tokenization
Word-level
"I love Nepal" -> ["I", "love", "Nepal"]
Problems:
- Large vocabulary
- Out-of-vocabulary (OOV) words
- Poor generalization
Subword tokenization
observability -> observ + ability
Benefits:
- Smaller vocabulary
- Handles unseen words
- Better compression
Byte-level tokenization
Tokenization can operate directly on UTF-8 bytes.
Advantages:
- No unknown tokens
- Handles emojis and rare scripts
- Language-agnostic behavior
Major Algorithms
Byte Pair Encoding (BPE)
Training algorithm:
- Start with characters
- Count most frequent adjacent pairs
- Merge the most frequent pair
- Repeat until the vocab size is reached
Example merges:
Initial: l o w l o w e r l o w e s t
Merge (l,o) -> lo Merge (lo,w) -> low
WordPiece
Used in BERT-style models.
Differences from BPE:
- BPE merges the most frequent pairs
- WordPiece merges the pair that maximizes likelihood improvement
Example:
playing -> play + ##ing
Unigram language model
Used in T5-style models.
Process:
- Start with a large candidate vocabulary
- Remove tokens that reduce likelihood the least
- Keep the most probable tokens
SentencePiece
SentencePiece supports both BPE and Unigram and avoids pre-splitting on whitespace.
Example:
I love Nepal -> _I _love _Nepal (underscore marks whitespace)
Tokenization Pipeline
Raw Text
-> Normalizer
-> PreTokenizer
-> Tokenization Model (BPE / WordPiece / Unigram)
-> PostProcessor
-> Token IDs
-> Embedding Lookup
Architecture Diagrams
Where tokenization sits in the LLM stack
Raw Text
-> Normalizer
-> Tokenizer
-> Token IDs
-> Embedding Lookup
-> Transformer Blocks
-> Logits
-> Sampling / Decoding
-> Token IDs
-> Detokenizer
-> Output Text
Tokenizer training lifecycle
Corpus
-> Cleaning / Normalization
-> Pre-tokenization
-> Trainer (BPE / WordPiece / Unigram)
-> Vocab + Merges
-> Tokenizer Config
-> Evaluation
-> Release
Key Design Choices
Normalization
Common choices:
- Unicode normalization (NFC or NFKC)
- Lowercasing
- Accent stripping
- Standardizing whitespace
- Normalizing punctuation and quotes
These decisions change the vocabulary and can improve compression, but may lose information that is useful for tasks like names, code, or multilingual text.
Pre-tokenization
Typical strategies:
- Whitespace splitting
- Regex-based segmentation
- Byte-level splitting for Unicode safety
Pre-tokenization determines which character sequences are eligible for merges and heavily influences how tokens align to words.
Vocabulary size
Tradeoffs:
- Smaller vocab: more tokens per sentence, but smaller embeddings
- Larger vocab: fewer tokens, but larger embeddings and more memory
A practical sweet spot often balances token count reduction with model size and training stability.
Special tokens
Common tokens and why they exist:
"[PAD]"for batch padding"[UNK]"for unknown tokens (rare in byte-level models)"[CLS]"and"[SEP]"for classifier or segment boundaries"<BOS>"and"<EOS>"for generation control
Measuring Tokenizer Quality
Useful diagnostics:
- Average tokens per word on a representative corpus
- Percentage of unknown tokens (should be near zero for byte-level models)
- Compression ratio vs. raw text
- Fertility: average number of tokens per word
- Error analysis on domain terms (code, biomedical, legal)
Better tokenization is not just fewer tokens. It is about stable segmentation, preserving meaning, and minimizing pathological splits for your domain.
Libraries and Ecosystem
- Hugging Face Tokenizers: Rust implementation, fast, modular, supports BPE/WordPiece/Unigram
- SentencePiece: C++ implementation, multilingual friendly, used widely in Google models
- tiktoken: optimized byte-level BPE for GPT-style models
Practical Python Examples
Inspect tokens, ids, and offsets
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Tokenization matters for LLMs."
encoding = tokenizer(
text,
return_offsets_mapping=True,
add_special_tokens=False,
)
print(encoding.tokens())
print(encoding["input_ids"])
print(encoding["offset_mapping"])
Train BPE (Hugging Face)
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=5000)
tokenizer.train(["data.txt"], trainer)
output = tokenizer.encode("I love observability")
print(output.tokens)
print(output.ids)
Train WordPiece (Hugging Face tokenizers)
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
trainer = WordPieceTrainer(vocab_size=8000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["data.txt"], trainer)
print(tokenizer.encode("playing football").tokens)
Load a pretrained tokenizer (Transformers)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer("I love Nepal")
print(encoded["input_ids"])
SentencePiece training
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input="data.txt",
model_prefix="m",
vocab_size=8000,
model_type="unigram",
)
tiktoken for token counts
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("I love Nepal")
print(tokens)
print("num_tokens:", len(tokens))
Add special tokens and resize embeddings
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
specials = {"additional_special_tokens": ["<TOOL>", "</TOOL>"]}
num_added = tokenizer.add_special_tokens(specials)
model.resize_token_embeddings(len(tokenizer))
print("added:", num_added)
Pros and Cons
| Method | Pros | Cons |
|---|---|---|
| Word-level | Simple | Huge vocab, OOV issues |
| BPE | Efficient, stable | Greedy merging |
| WordPiece | Probabilistic | Slightly slower |
| Unigram | Flexible | Training complexity |
| Byte-level BPE | No OOV, universal | More tokens for clean text |
Tokenization and Transformers
Attention cost
Self-attention complexity is O(n^2). More tokens increase compute cost, so better compression directly improves efficiency.
Embedding impact
Vocabulary size influences:
- Embedding matrix size
- Model parameter count
- Memory usage
Common Pitfalls
- Training the tokenizer on a dataset that does not match the deployment domain
- Over-normalizing and losing meaningful casing or punctuation
- Very large vocabularies that bloat embeddings without reducing tokens enough
- Inconsistent pre-tokenization between training and inference
- Ignoring multilingual scripts or mixed-language inputs
Practical Heuristics
- Measure tokens per document before and after any tokenizer change
- Check segmentation of domain-specific words and code identifiers
- Keep a small reserved space for new special tokens
- Prefer byte-level models when unknown tokens are unacceptable
- Re-evaluate tokenizer choices when the product domain shifts
Future of Tokenization
- Token-free models
- Dynamic segmentation
- Neural compression
- Joint tokenizer and model learning
- Longer context windows
Final Takeaways
Tokenization is the foundation of:
- Compression
- Segmentation
- Vocabulary construction
- Transformer memory efficiency
Choosing the right tokenizer affects the entire model pipeline.