Tokenization for LLMs: BPE, Unigram, SentencePiece, WordPiece

Tokenization for LLMs – BPE, Unigram, SentencePiece, WordPiece

Tokenization governs vocabulary efficiency, multilingual coverage, and sequence length – often the hidden lever on cost and accuracy. The engineering challenge involves balancing vocabulary size against sequence length, handling diverse languages and scripts uniformly, managing out-of-vocabulary terms gracefully, and ensuring tokenization doesn't introduce biases or inefficiencies that cascade through model training and inference.

Explained for People without AI-Background

- Tokenization is like deciding how to break a book into chunks for processing - you could use letters (too many pieces), whole words (too many unique items), or something in between like syllables. Each choice affects how much the model needs to read, how well it handles new words, and how expensive it is to process different languages.

Why Tokenization Matters More Than Expected

Tokenization silently determines model economics and capabilities. Poor tokenization can double sequence lengths, directly doubling compute costs. A tokenizer treating "don't" as ["don", "'", "t"] wastes capacity and breaks semantic units. Multilingual tokenizers that over-fragment non-English text create systematic disadvantages. The same 1000-word English document might be 1,500 tokens while identical Chinese content becomes 3,000 tokens, tripling processing costs.

Byte Pair Encoding (BPE) Fundamentals

BPE builds vocabularies by iteratively merging frequent byte pairs. Starting with individual bytes, it identifies the most common adjacent pair, merges them into a new token, and repeats until reaching the target vocabulary size. GPT models use BPE variants, typically with 50k vocabularies. The algorithm naturally handles morphology - "running" might tokenize as ["run", "ning"], capturing root and suffix separately, enabling generalization to "runner" or "runs".

BPE Training Process

Training BPE requires a representative corpus and frequency statistics. The algorithm counts byte pair frequencies, merges the highest, updates frequencies, and iterates. A 50k vocabulary requires ~50k merge operations. The merge order defines the vocabulary - early merges capture common patterns, while later merges grab frequent words. Pre-tokenization (splitting on whitespace) affects learned vocabularies significantly, with space-handling determining how models see word boundaries.

Unigram Language Model Tokenization

Unigram tokenization treats tokenization as probabilistic language modeling. Starting with a large candidate vocabulary, it iteratively prunes tokens that minimally impact likelihood. Each token receives a probability, and tokenization selects the most probable segmentation. This principled approach often yields better compression than BPE. T5 and mT5 use Unigram, showing particular strength on morphologically rich languages.

SentencePiece: Language-Agnostic Processing

SentencePiece treats text as raw byte streams, eliminating language-specific preprocessing. It handles spaces explicitly as special tokens (▁), enabling true language independence. Both BPE and Unigram algorithms work within SentencePiece. The library includes critical features: reversible tokenization preserving original text, handling of control codes, and robust unknown token management. Most multilingual models use SentencePiece for its consistency across scripts.

WordPiece and BERT-Style Tokenization

WordPiece, developed for BERT, uses a likelihood-based scoring similar to BPE but selects merges maximizing training data likelihood. The distinctive ## prefix marks non-initial subwords - "running" becomes ["run", "##ning"]. WordPiece vocabularies typically range 30-50k. While effective for BERT-style models, WordPiece sees less use in modern autoregressive LLMs, with most preferring BPE or Unigram variants.

Vocabulary Size Tradeoffs

Vocabulary size creates fundamental tradeoffs. Large vocabularies (100k+) reduce sequence lengths but increase model parameters and softmax computation. Small vocabularies (10k) minimize parameters but create longer sequences. The embedding table size equals vocab_size × hidden_dim - at 768 dimensions, a 100k vocabulary needs 300MB just for embeddings. Most models settle on 30-50k as optimal, balancing efficiency with coverage.

Character and Byte-Level Approaches

Character-level tokenization uses individual characters as tokens, typically yielding 100-300 token vocabularies. While parameter-efficient and handling any text, character tokenization creates very long sequences - 5-10x word-level length. Byte-level tokenization goes further, using raw bytes (256 tokens), guaranteeing no out-of-vocabulary issues. Recent models like ByT5 explore byte-level processing, trading sequence length for ultimate flexibility.

Multilingual Tokenization Challenges

Multilingual tokenizers must balance coverage across scripts. A tokenizer optimized for English might represent Chinese characters as multiple bytes, creating 3x longer sequences. Conversely, dedicating tokens to rare scripts wastes vocabulary space. Modern approaches allocate vocabulary proportionally to training data, accept some length disparities, and use script-specific preprocessing. Models like mT5 and BLOOM show relatively balanced efficiency across languages.

Tokenization and Model Performance

Tokenization directly impacts model capabilities. Arithmetic suffers when numbers split unpredictably - "1234" might become ["12", "34"] while "1235" becomes ["1", "235"]. Code understanding degrades when syntax tokens fragment. Chemical formulas and URLs often tokenize poorly. These issues compound during training, with models learning spurious patterns from tokenization artifacts rather than semantic content.

Special Tokens and Control Codes

Production tokenizers reserve tokens for special purposes: [PAD] for batching, [CLS]/[SEP] for structure, [MASK] for training, and task-specific markers. Chat models use tokens like <|im_start|> for conversation structure. These tokens must be handled carefully - accidentally training on padding tokens or exposing control tokens to users causes failures. Proper special token management requires coordination across training, fine-tuning, and inference.

Tokenizer Training Best Practices

Training robust tokenizers requires diverse, representative data - not just web crawls but code, scientific text, and multilingual content. Normaligation choices (NFD/NFC Unicode, lowercasing) permanently affect model behavior. Pre-tokenization rules determine boundary handling. Vocabulary size should account for embedding table memory. Coverage analysis ensures important domains aren't fragmented. Many production failures trace to tokenizer training oversights.

Fast Tokenization for Production

Production tokenizers must handle streaming text at high throughput. Rust implementations (HuggingFace tokenizers) achieve 100x speedups over Python. Techniques include finite automata for matching, caching common sequences, and parallel processing. Batch tokenization amortizes overhead. For real-time applications, tokenization latency can dominate model inference time, making optimization critical.

Common Tokenization Pitfalls

- Training tokenizer on different data than model; domain mismatch causing inefficiency.

- Ignoring tokenization in cost estimates; assuming token counts equal word counts.

- Inconsistent special token handling; training/inference mismatches causing errors.

- Poor multilingual balance; systematic disadvantages for non-English users.

Related Deep Dives in This Series

- [LLM Foundations & Architecture: Overview](/llm-foundations-architecture-overview)

- [Training & Tuning LLMs: Pretraining, SFT, RLHF, DPO](/training-tuning-llms-pretraining-sft-rlhf-dpo)

- [Running LLMs in Production: Inference, Cost, Latency, Reliability](/llm-production-inference-cost-latency)

Internal Reference

0 comments

Artificial Intelligence AI

skool.com/artificial-intelligence-8395

Artificial Intelligence (AI): Machine Learning, Deep Learning, Natural Language Processing NLP, Computer Vision, ANI, AGI, ASI, Human in the loop, SEO

Real Men Real Style Community

This World is a Simulation

Bring people together around your passion and get paid.