Skip to content

Tokenization

Tokenization is the process of converting text into numerical tokens that language models can understand. LLMBuilder provides comprehensive tokenization tools with support for multiple algorithms and customization options.

🎯 Overview

Tokenization bridges the gap between human text and machine learning:

graph LR
    A["Hello world!"] --> B[Tokenizer]
    B --> C["[15496, 995, 0]"]
    C --> D[Model]
    D --> E[Predictions]
    E --> F[Detokenizer]
    F --> G["Generated text"]

    style A fill:#e1f5fe
    style C fill:#fff3e0
    style G fill:#e8f5e8

πŸ”€ Tokenization Algorithms

LLMBuilder supports several tokenization algorithms:

BPE is the most popular algorithm for language models:

from llmbuilder.tokenizer import TokenizerTrainer
from llmbuilder.config import TokenizerConfig

# Configure BPE tokenizer
config = TokenizerConfig(
    vocab_size=16000,
    model_type="bpe",
    character_coverage=1.0,
    max_sentence_length=4096
)

# Train tokenizer
trainer = TokenizerTrainer(config=config)
trainer.train(
    input_file="training_data.txt",
    output_dir="./tokenizer",
    model_prefix="tokenizer"
)

Advantages:

  • Balances vocabulary size and coverage
  • Handles out-of-vocabulary words well
  • Works across multiple languages
  • Standard for most modern LLMs

Unigram Language Model

Statistical approach that optimizes vocabulary:

config = TokenizerConfig(
    vocab_size=16000,
    model_type="unigram",
    character_coverage=0.9995,
    unk_token="<unk>",
    bos_token="<s>",
    eos_token="</s>"
)

Advantages:

  • Probabilistically optimal vocabulary
  • Better handling of morphologically rich languages
  • Flexible subword boundaries

Word-Level Tokenization

Simple word-based tokenization:

config = TokenizerConfig(
    vocab_size=50000,
    model_type="word",
    lowercase=True,
    remove_accents=True
)

Use cases:

  • Small, domain-specific datasets
  • Languages with clear word boundaries
  • When interpretability is important

Character-Level Tokenization

Character-by-character tokenization:

config = TokenizerConfig(
    model_type="char",
    vocab_size=256,  # Usually small for character-level
    normalize=True
)

Use cases:

  • Very small datasets
  • Morphologically complex languages
  • When dealing with noisy text

πŸš€ Quick Start

CLI Training

# Train a BPE tokenizer
llmbuilder data tokenizer \
  --input training_data.txt \
  --output ./tokenizer \
  --vocab-size 16000 \
  --model-type bpe

Python API

from llmbuilder.tokenizer import train_tokenizer

# Train tokenizer with default settings
tokenizer = train_tokenizer(
    input_file="training_data.txt",
    output_dir="./tokenizer",
    vocab_size=16000,
    special_tokens=["<pad>", "<unk>", "<s>", "</s>"]
)

# Test the tokenizer
text = "Hello, world! This is a test."
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)

print(f"Original: {text}")
print(f"Tokens: {tokens}")
print(f"Decoded: {decoded}")

βš™οΈ Configuration Options

Basic Configuration

from llmbuilder.config import TokenizerConfig

config = TokenizerConfig(
    # Core settings
    vocab_size=16000,           # Vocabulary size
    model_type="bpe",           # Algorithm: bpe, unigram, word, char

    # Text preprocessing
    lowercase=False,            # Convert to lowercase
    remove_accents=False,       # Remove accent marks
    normalize=True,             # Unicode normalization

    # Coverage and quality
    character_coverage=1.0,     # Character coverage (0.0-1.0)
    max_sentence_length=4096,   # Maximum sentence length
    min_frequency=2,            # Minimum token frequency

    # Special tokens
    unk_token="<unk>",         # Unknown token
    bos_token="<s>",           # Beginning of sequence
    eos_token="</s>",          # End of sequence
    pad_token="<pad>",         # Padding token
    mask_token="<mask>",       # Masking token (for MLM)
)

Advanced Configuration

config = TokenizerConfig(
    # Algorithm-specific settings
    bpe_dropout=0.1,           # BPE dropout for regularization
    split_digits=True,         # Split numbers into digits
    split_by_whitespace=True,  # Pre-tokenize by whitespace
    split_by_punctuation=True, # Pre-tokenize by punctuation

    # Vocabulary control
    max_token_length=16,       # Maximum token length
    vocab_threshold=1e-6,      # Vocabulary pruning threshold
    shrinking_factor=0.75,     # Unigram shrinking factor

    # Training control
    num_threads=8,             # Number of training threads
    seed=42,                   # Random seed
    verbose=True               # Verbose training output
)

πŸŽ›οΈ Special Tokens

Special tokens serve specific purposes in language models:

Standard Special Tokens

special_tokens = {
    "<pad>": "Padding token for batch processing",
    "<unk>": "Unknown/out-of-vocabulary token",
    "<s>": "Beginning of sequence token",
    "</s>": "End of sequence token",
    "<mask>": "Masking token for masked language modeling"
}

config = TokenizerConfig(
    vocab_size=16000,
    special_tokens=list(special_tokens.keys())
)

Custom Special Tokens

# Add domain-specific special tokens
custom_tokens = [
    "<code>", "</code>",       # Code blocks
    "<math>", "</math>",       # Mathematical expressions
    "<user>", "<assistant>",   # Chat/dialogue markers
    "<system>",                # System messages
    "<tool_call>", "</tool_call>"  # Function calls
]

config = TokenizerConfig(
    vocab_size=16000,
    special_tokens=["<pad>", "<unk>", "<s>", "</s>"] + custom_tokens
)

Token ID Management

# Access special token IDs
tokenizer = train_tokenizer("data.txt", "./tokenizer", config=config)

print(f"PAD token ID: {tokenizer.pad_token_id}")
print(f"UNK token ID: {tokenizer.unk_token_id}")
print(f"BOS token ID: {tokenizer.bos_token_id}")
print(f"EOS token ID: {tokenizer.eos_token_id}")

# Custom token IDs
code_start_id = tokenizer.token_to_id("<code>")
code_end_id = tokenizer.token_to_id("</code>")

πŸ” Tokenizer Analysis

Vocabulary Analysis

from llmbuilder.tokenizer import analyze_tokenizer

# Analyze trained tokenizer
analysis = analyze_tokenizer("./tokenizer")

print(f"πŸ“Š Tokenizer Analysis:")
print(f"  Vocabulary size: {analysis.vocab_size:,}")
print(f"  Average token length: {analysis.avg_token_length:.2f}")
print(f"  Character coverage: {analysis.character_coverage:.4f}")
print(f"  Compression ratio: {analysis.compression_ratio:.2f}")

# Most common tokens
print(f"\nπŸ”€ Most common tokens:")
for token, freq in analysis.top_tokens[:10]:
    print(f"  '{token}': {freq:,}")

# Token length distribution
print(f"\nπŸ“ Token length distribution:")
for length, count in analysis.length_distribution.items():
    print(f"  {length} chars: {count:,} tokens")

Text Coverage Analysis

# Test tokenizer on sample texts
test_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence and machine learning.",
    "Code: print('Hello, world!')",
    "Mathematical equation: E = mcΒ²"
]

for text in test_texts:
    tokens = tokenizer.encode(text)
    decoded = tokenizer.decode(tokens)

    print(f"\nText: {text}")
    print(f"Tokens ({len(tokens)}): {tokens}")
    print(f"Decoded: {decoded}")
    print(f"Perfect reconstruction: {text == decoded}")

Compression Analysis

def analyze_compression(tokenizer, text_file):
    """Analyze tokenization compression."""
    with open(text_file, 'r', encoding='utf-8') as f:
        text = f.read()

    # Character count
    char_count = len(text)

    # Token count
    tokens = tokenizer.encode(text)
    token_count = len(tokens)

    # Compression ratio
    compression_ratio = char_count / token_count

    print(f"πŸ“Š Compression Analysis:")
    print(f"  Characters: {char_count:,}")
    print(f"  Tokens: {token_count:,}")
    print(f"  Compression ratio: {compression_ratio:.2f}")
    print(f"  Tokens per 1000 chars: {1000 / compression_ratio:.1f}")

    return compression_ratio

# Analyze compression
ratio = analyze_compression(tokenizer, "training_data.txt")

πŸ”„ Tokenizer Usage

Basic Encoding/Decoding

from llmbuilder.tokenizer import Tokenizer

# Load trained tokenizer
tokenizer = Tokenizer.from_pretrained("./tokenizer")

# Encode text to tokens
text = "Hello, how are you today?"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")

# Decode tokens back to text
decoded_text = tokenizer.decode(tokens)
print(f"Decoded: {decoded_text}")

# Encode with special tokens
tokens_with_special = tokenizer.encode(
    text,
    add_bos_token=True,  # Add beginning-of-sequence token
    add_eos_token=True   # Add end-of-sequence token
)
print(f"With special tokens: {tokens_with_special}")

Batch Processing

# Encode multiple texts
texts = [
    "First example text.",
    "Second example text.",
    "Third example text."
]

# Batch encode
batch_tokens = tokenizer.encode_batch(texts)
for i, tokens in enumerate(batch_tokens):
    print(f"Text {i+1}: {tokens}")

# Batch decode
batch_decoded = tokenizer.decode_batch(batch_tokens)
for i, text in enumerate(batch_decoded):
    print(f"Decoded {i+1}: {text}")

Padding and Truncation

# Encode with padding and truncation
tokens = tokenizer.encode(
    text,
    max_length=512,      # Maximum sequence length
    padding="max_length", # Pad to max_length
    truncation=True,     # Truncate if longer
    return_attention_mask=True  # Return attention mask
)

print(f"Tokens: {tokens['input_ids']}")
print(f"Attention mask: {tokens['attention_mask']}")

🎯 Domain-Specific Tokenization

Code Tokenization

# Configure tokenizer for code
code_config = TokenizerConfig(
    vocab_size=32000,
    model_type="bpe",
    split_by_whitespace=True,
    split_by_punctuation=False,  # Keep operators together
    special_tokens=[
        "<pad>", "<unk>", "<s>", "</s>",
        "<code>", "</code>",
        "<comment>", "</comment>",
        "<string>", "</string>",
        "<number>", "</number>"
    ]
)

# Train on code data
code_tokenizer = train_tokenizer(
    input_file="code_dataset.txt",
    output_dir="./code_tokenizer",
    config=code_config
)

Multilingual Tokenization

# Configure for multiple languages
multilingual_config = TokenizerConfig(
    vocab_size=64000,  # Larger vocab for multiple languages
    model_type="unigram",  # Better for morphologically rich languages
    character_coverage=0.9995,  # High coverage for diverse scripts
    normalize=True,
    special_tokens=[
        "<pad>", "<unk>", "<s>", "</s>",
        "<en>", "<es>", "<fr>", "<de>",  # Language tags
        "<zh>", "<ja>", "<ar>", "<hi>"
    ]
)

Scientific Text Tokenization

# Configure for scientific text
scientific_config = TokenizerConfig(
    vocab_size=24000,
    model_type="bpe",
    split_digits=False,  # Keep numbers together
    special_tokens=[
        "<pad>", "<unk>", "<s>", "</s>",
        "<math>", "</math>",
        "<formula>", "</formula>",
        "<citation>", "</citation>",
        "<table>", "</table>",
        "<figure>", "</figure>"
    ]
)

πŸ”§ Advanced Features

Tokenizer Merging

# Merge multiple tokenizers
from llmbuilder.tokenizer import merge_tokenizers

base_tokenizer = Tokenizer.from_pretrained("./base_tokenizer")
domain_tokenizer = Tokenizer.from_pretrained("./domain_tokenizer")

merged_tokenizer = merge_tokenizers(
    [base_tokenizer, domain_tokenizer],
    output_dir="./merged_tokenizer",
    vocab_size=32000
)

Vocabulary Extension

# Extend existing tokenizer vocabulary
new_tokens = ["<new_token_1>", "<new_token_2>", "domain_term"]

extended_tokenizer = tokenizer.extend_vocabulary(
    new_tokens=new_tokens,
    output_dir="./extended_tokenizer"
)

print(f"Original vocab size: {len(tokenizer)}")
print(f"Extended vocab size: {len(extended_tokenizer)}")

Tokenizer Adaptation

# Adapt tokenizer to new domain
adapted_tokenizer = tokenizer.adapt_to_domain(
    domain_data="new_domain_data.txt",
    adaptation_ratio=0.1,  # 10% of vocab from new domain
    output_dir="./adapted_tokenizer"
)

πŸ“Š Performance Optimization

Fast Tokenization

# Enable fast tokenization
tokenizer = Tokenizer.from_pretrained(
    "./tokenizer",
    use_fast=True,      # Use fast Rust implementation
    num_threads=8       # Parallel processing
)

# Benchmark tokenization speed
import time

text = "Your text here" * 1000  # Large text
start_time = time.time()
tokens = tokenizer.encode(text)
end_time = time.time()

print(f"Tokenization speed: {len(tokens) / (end_time - start_time):.0f} tokens/sec")

Memory Optimization

# Optimize for memory usage
tokenizer = Tokenizer.from_pretrained(
    "./tokenizer",
    low_memory=True,    # Reduce memory usage
    mmap=True          # Memory-map vocabulary files
)

Caching

# Enable tokenization caching
tokenizer.enable_cache(
    cache_dir="./tokenizer_cache",
    max_cache_size="1GB"
)

# Tokenization results will be cached for repeated texts

🚨 Troubleshooting

Common Issues

1. Poor Tokenization Quality

# Increase vocabulary size
config.vocab_size = 32000

# Improve character coverage
config.character_coverage = 0.9999

# Adjust minimum frequency
config.min_frequency = 1

2. Out-of-Memory During Training

# Reduce training data size
config.max_sentence_length = 1024

# Use fewer threads
config.num_threads = 4

# Process in chunks
trainer.train_chunked(
    input_file="large_data.txt",
    chunk_size=1000000  # 1M characters per chunk
)

3. Slow Tokenization

# Use fast tokenizer
tokenizer = Tokenizer.from_pretrained("./tokenizer", use_fast=True)

# Enable parallel processing
tokenizer.enable_parallelism(True)

# Batch process texts
tokens = tokenizer.encode_batch(texts, batch_size=1000)

Validation and Testing

# Validate tokenizer quality
def validate_tokenizer(tokenizer, test_texts):
    """Validate tokenizer on test texts."""
    issues = []

    for text in test_texts:
        tokens = tokenizer.encode(text)
        decoded = tokenizer.decode(tokens)

        if text != decoded:
            issues.append({
                'original': text,
                'decoded': decoded,
                'tokens': tokens
            })

    if issues:
        print(f"⚠️  Found {len(issues)} reconstruction issues:")
        for issue in issues[:5]:  # Show first 5
            print(f"  Original: {issue['original']}")
            print(f"  Decoded:  {issue['decoded']}")
            print()
    else:
        print("βœ… All texts reconstructed perfectly!")

    return len(issues) == 0

# Test tokenizer
test_texts = [
    "Hello, world!",
    "The quick brown fox jumps over the lazy dog.",
    "Special characters: @#$%^&*()",
    "Numbers: 123 456.789",
    "Unicode: cafΓ© naΓ―ve rΓ©sumΓ©"
]

is_valid = validate_tokenizer(tokenizer, test_texts)

πŸ“š Best Practices

1. Vocabulary Size Selection

  • Small datasets (< 1M tokens): 8K - 16K vocabulary
  • Medium datasets (1M - 100M tokens): 16K - 32K vocabulary
  • Large datasets (> 100M tokens): 32K - 64K vocabulary
  • Multilingual: 64K - 128K vocabulary

2. Algorithm Selection

  • BPE: General purpose, most compatible
  • Unigram: Better for morphologically rich languages
  • Word: Simple datasets, interpretability needed
  • Character: Very small datasets, noisy text

3. Special Token Strategy

  • Always include <pad>, <unk>, <s>, </s>
  • Add domain-specific tokens for better performance
  • Reserve 5-10% of vocabulary for special tokens
  • Use consistent special token formats

4. Training Data Quality

  • Use the same data distribution as your target task
  • Include diverse text types and styles
  • Clean data but preserve important patterns
  • Ensure sufficient data size (> 1M characters recommended)

5. Validation and Testing

  • Test on held-out data
  • Verify perfect reconstruction for important text types
  • Monitor compression ratios
  • Check coverage of domain-specific terms

Tokenization Tips

  • Train your tokenizer on the same type of data you'll use for model training
  • Always validate tokenizer quality before training your model
  • Consider the trade-off between vocabulary size and model size
  • Save tokenizer training logs and configurations for reproducibility
  • Test tokenization on edge cases and special characters