Skip to content

Tokenization

Tokenization converts text into numerical tokens that language models can understand.

Overview

Tokenization bridges the gap between human text and machine learning:

graph LR
    A["Hello world!"] --> B[Tokenizer]
    B --> C["[15496, 995, 0]"]
    C --> D[Model]
    D --> E[Predictions]
    E --> F[Detokenizer]
    F --> G["Generated text"]

Tokenization Algorithms

LLMBuilder supports several tokenization algorithms:

BPE is the most popular algorithm for language models:

from llmbuilder.tokenizer import TokenizerTrainer
from llmbuilder.config import TokenizerConfig

# Configure BPE tokenizer
config = TokenizerConfig(
    vocab_size=16000,
    model_type="bpe"
)

# Train tokenizer
trainer = TokenizerTrainer(config=config)
trainer.train(
    input_file="training_data.txt",
    output_dir="./tokenizer"
)

Unigram Language Model

Statistical approach that optimizes vocabulary:

config = TokenizerConfig(
    vocab_size=16000,
    model_type="unigram"
)

Quick Start

CLI Training

# Train a BPE tokenizer
llmbuilder tokenizer train \
  --input training_data.txt \
  --output ./tokenizer \
  --vocab-size 16000 \
  --algorithm bpe

Python API

from llmbuilder.tokenizer import train_tokenizer

# Train tokenizer with default settings
tokenizer = train_tokenizer(
    input_file="training_data.txt",
    output_dir="./tokenizer",
    vocab_size=16000
)

# Test the tokenizer
text = "Hello, world! This is a test."
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)

print(f"Original: {text}")
print(f"Tokens: {tokens}")
print(f"Decoded: {decoded}")

Configuration Options

Basic Configuration

from llmbuilder.config import TokenizerConfig

config = TokenizerConfig(
    # Core settings
    vocab_size=16000,           # Vocabulary size
    model_type="bpe",           # Algorithm: bpe, unigram, word, char

    # Special tokens
    unk_token="<unk>",         # Unknown token
    bos_token="<s>",           # Beginning of sequence
    eos_token="</s>",          # End of sequence
    pad_token="<pad>",         # Padding token
)

Next Steps

Start with BPE tokenization for most use cases.