Feb 4, 2026

Tokenization in Modern NLP and LLMs

Tokenization converts raw text into discrete units (tokens) and maps them to integer IDs. This step determines how efficiently a model uses its context window, how large embeddings need to be, and how well the system generalizes to new inputs.

Introduction

Neural networks cannot process raw text directly. A tokenizer transforms text like:

"I love Nepal"

into a sequence of IDs such as:

[1045, 2293, 8224, ...]

Tokenization affects:

  • Context window usage
  • Memory cost
  • Model size and capacity
  • Generalization to rare or new words

Evolution of Tokenization

Word-level

"I love Nepal" -> ["I", "love", "Nepal"]

Problems:

  • Large vocabulary
  • Out-of-vocabulary (OOV) words
  • Poor generalization

Subword tokenization

observability -> observ + ability

Benefits:

  • Smaller vocabulary
  • Handles unseen words
  • Better compression

Byte-level tokenization

Tokenization can operate directly on UTF-8 bytes.

Advantages:

  • No unknown tokens
  • Handles emojis and rare scripts
  • Language-agnostic behavior

Major Algorithms

Byte Pair Encoding (BPE)

Training algorithm:

  1. Start with characters
  2. Count most frequent adjacent pairs
  3. Merge the most frequent pair
  4. Repeat until the vocab size is reached

Example merges:

Initial: l o w l o w e r l o w e s t

Merge (l,o) -> lo Merge (lo,w) -> low

WordPiece

Used in BERT-style models.

Differences from BPE:

  • BPE merges the most frequent pairs
  • WordPiece merges the pair that maximizes likelihood improvement

Example:

playing -> play + ##ing

Unigram language model

Used in T5-style models.

Process:

  1. Start with a large candidate vocabulary
  2. Remove tokens that reduce likelihood the least
  3. Keep the most probable tokens

SentencePiece

SentencePiece supports both BPE and Unigram and avoids pre-splitting on whitespace.

Example:

I love Nepal -> _I _love _Nepal (underscore marks whitespace)

Tokenization Pipeline

Raw Text
  -> Normalizer
  -> PreTokenizer
  -> Tokenization Model (BPE / WordPiece / Unigram)
  -> PostProcessor
  -> Token IDs
  -> Embedding Lookup

Architecture Diagrams

Where tokenization sits in the LLM stack

Raw Text
  -> Normalizer
  -> Tokenizer
  -> Token IDs
  -> Embedding Lookup
  -> Transformer Blocks
  -> Logits
  -> Sampling / Decoding
  -> Token IDs
  -> Detokenizer
  -> Output Text

Tokenizer training lifecycle

Corpus
  -> Cleaning / Normalization
  -> Pre-tokenization
  -> Trainer (BPE / WordPiece / Unigram)
  -> Vocab + Merges
  -> Tokenizer Config
  -> Evaluation
  -> Release

Key Design Choices

Normalization

Common choices:

  • Unicode normalization (NFC or NFKC)
  • Lowercasing
  • Accent stripping
  • Standardizing whitespace
  • Normalizing punctuation and quotes

These decisions change the vocabulary and can improve compression, but may lose information that is useful for tasks like names, code, or multilingual text.

Pre-tokenization

Typical strategies:

  • Whitespace splitting
  • Regex-based segmentation
  • Byte-level splitting for Unicode safety

Pre-tokenization determines which character sequences are eligible for merges and heavily influences how tokens align to words.

Vocabulary size

Tradeoffs:

  • Smaller vocab: more tokens per sentence, but smaller embeddings
  • Larger vocab: fewer tokens, but larger embeddings and more memory

A practical sweet spot often balances token count reduction with model size and training stability.

Special tokens

Common tokens and why they exist:

  • "[PAD]" for batch padding
  • "[UNK]" for unknown tokens (rare in byte-level models)
  • "[CLS]" and "[SEP]" for classifier or segment boundaries
  • "<BOS>" and "<EOS>" for generation control

Measuring Tokenizer Quality

Useful diagnostics:

  • Average tokens per word on a representative corpus
  • Percentage of unknown tokens (should be near zero for byte-level models)
  • Compression ratio vs. raw text
  • Fertility: average number of tokens per word
  • Error analysis on domain terms (code, biomedical, legal)

Better tokenization is not just fewer tokens. It is about stable segmentation, preserving meaning, and minimizing pathological splits for your domain.

Libraries and Ecosystem

  • Hugging Face Tokenizers: Rust implementation, fast, modular, supports BPE/WordPiece/Unigram
  • SentencePiece: C++ implementation, multilingual friendly, used widely in Google models
  • tiktoken: optimized byte-level BPE for GPT-style models

Practical Python Examples

Inspect tokens, ids, and offsets

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Tokenization matters for LLMs."

encoding = tokenizer(
    text,
    return_offsets_mapping=True,
    add_special_tokens=False,
)

print(encoding.tokens())
print(encoding["input_ids"])
print(encoding["offset_mapping"])

Train BPE (Hugging Face)

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=5000)

tokenizer.train(["data.txt"], trainer)

output = tokenizer.encode("I love observability")
print(output.tokens)
print(output.ids)

Train WordPiece (Hugging Face tokenizers)

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
trainer = WordPieceTrainer(vocab_size=8000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.train(["data.txt"], trainer)
print(tokenizer.encode("playing football").tokens)

Load a pretrained tokenizer (Transformers)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer("I love Nepal")
print(encoded["input_ids"])

SentencePiece training

import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input="data.txt",
    model_prefix="m",
    vocab_size=8000,
    model_type="unigram",
)

tiktoken for token counts

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

tokens = enc.encode("I love Nepal")
print(tokens)
print("num_tokens:", len(tokens))

Add special tokens and resize embeddings

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

specials = {"additional_special_tokens": ["<TOOL>", "</TOOL>"]}
num_added = tokenizer.add_special_tokens(specials)
model.resize_token_embeddings(len(tokenizer))

print("added:", num_added)

Pros and Cons

MethodProsCons
Word-levelSimpleHuge vocab, OOV issues
BPEEfficient, stableGreedy merging
WordPieceProbabilisticSlightly slower
UnigramFlexibleTraining complexity
Byte-level BPENo OOV, universalMore tokens for clean text

Tokenization and Transformers

Attention cost

Self-attention complexity is O(n^2). More tokens increase compute cost, so better compression directly improves efficiency.

Embedding impact

Vocabulary size influences:

  • Embedding matrix size
  • Model parameter count
  • Memory usage

Common Pitfalls

  • Training the tokenizer on a dataset that does not match the deployment domain
  • Over-normalizing and losing meaningful casing or punctuation
  • Very large vocabularies that bloat embeddings without reducing tokens enough
  • Inconsistent pre-tokenization between training and inference
  • Ignoring multilingual scripts or mixed-language inputs

Practical Heuristics

  • Measure tokens per document before and after any tokenizer change
  • Check segmentation of domain-specific words and code identifiers
  • Keep a small reserved space for new special tokens
  • Prefer byte-level models when unknown tokens are unacceptable
  • Re-evaluate tokenizer choices when the product domain shifts

Future of Tokenization

  • Token-free models
  • Dynamic segmentation
  • Neural compression
  • Joint tokenizer and model learning
  • Longer context windows

Final Takeaways

Tokenization is the foundation of:

  • Compression
  • Segmentation
  • Vocabulary construction
  • Transformer memory efficiency

Choosing the right tokenizer affects the entire model pipeline.


Thanks for reading! If you want to see future content, you can follow me on Twitter or get connected over at LinkedIn.


Support My Content

If you find my content helpful, consider supporting a humanitarian cause (building homes for elderly people in rural Terai region of Nepal) that I am planning with your donation:

Ethereum (ETH)

0xB62409A5B227D2aE7D8C66fdaA5EEf4eB4E37959

Thank you for your support!