Migration Guide¶
This guide helps you migrate from legacy LLMBuilder scripts and configurations to the new package structure. Whether you're upgrading from older versions or transitioning from standalone scripts, this guide covers all the changes you need to know.
🎯 Overview¶
The LLMBuilder package represents a major evolution from the original standalone scripts. Key improvements include:
- Unified API: Single import for all functionality
- Better Configuration: Structured, validated configurations
- CLI Interface: Comprehensive command-line tools
- Modular Design: Use only what you need
- Better Documentation: Complete guides and examples
📋 Migration Checklist¶
- [ ] Update Python version to 3.8+
- [ ] Install new LLMBuilder package
- [ ] Convert configuration files
- [ ] Update import statements
- [ ] Migrate training scripts
- [ ] Update CLI commands
- [ ] Test functionality
- [ ] Update deployment scripts
🔄 Configuration Migration¶
Legacy Configuration Format¶
Old format (config.json):
{
"n_layer": 12,
"n_head": 12,
"n_embd": 768,
"block_size": 1024,
"dropout": 0.1,
"bias": true,
"vocab_size": 16000,
"device": "cuda",
"batch_size": 16,
"learning_rate": 3e-4,
"max_iters": 10000,
"eval_interval": 500,
"eval_iters": 100
}
New Configuration Format¶
New format (structured):
{
"model": {
"vocab_size": 16000,
"num_layers": 12,
"num_heads": 12,
"embedding_dim": 768,
"max_seq_length": 1024,
"dropout": 0.1,
"bias": true,
"model_type": "gpt"
},
"training": {
"batch_size": 16,
"learning_rate": 3e-4,
"num_epochs": 10,
"warmup_steps": 1000,
"weight_decay": 0.01,
"max_grad_norm": 1.0,
"save_every": 1000,
"eval_every": 500,
"log_every": 100
},
"system": {
"device": "cuda",
"seed": 42,
"deterministic": false
}
}
Automatic Migration¶
LLMBuilder provides automatic migration for legacy configurations:
import llmbuilder as lb
# Load legacy configuration (automatically migrated)
config = lb.load_config("legacy_config.json")
# Or migrate explicitly
from llmbuilder.config import migrate_legacy_config
new_config = migrate_legacy_config("legacy_config.json")
new_config.save("new_config.json")
Configuration Key Mapping¶
Legacy Key | New Key | Notes |
---|---|---|
n_layer |
model.num_layers |
Number of transformer layers |
n_head |
model.num_heads |
Number of attention heads |
n_embd |
model.embedding_dim |
Embedding dimension |
block_size |
model.max_seq_length |
Maximum sequence length |
dropout |
model.dropout |
Dropout rate |
bias |
model.bias |
Use bias in linear layers |
vocab_size |
model.vocab_size |
Vocabulary size |
device |
system.device |
Training device |
batch_size |
training.batch_size |
Batch size |
learning_rate |
training.learning_rate |
Learning rate |
max_iters |
training.max_steps |
Maximum training steps |
eval_interval |
training.eval_every |
Evaluation frequency |
🐍 Code Migration¶
Legacy Import Pattern¶
Old way:
# Legacy imports
from model import GPTConfig, GPT
from train import train
from sample import sample
import tiktoken
# Legacy usage
config = GPTConfig(
n_layer=12,
n_head=12,
n_embd=768,
block_size=1024,
vocab_size=50257
)
model = GPT(config)
New Import Pattern¶
New way:
# New unified import
import llmbuilder as lb
# New usage
config = lb.load_config(preset="gpu_medium")
model = lb.build_model(config.model)
Training Script Migration¶
Legacy training script:
# train.py (legacy)
import torch
from model import GPTConfig, GPT
from dataset import get_batch
# Configuration
config = GPTConfig(...)
model = GPT(config)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
# Training loop
for iter in range(max_iters):
X, Y = get_batch('train')
logits, loss = model(X, Y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if iter % eval_interval == 0:
evaluate()
New training script:
# train.py (new)
import llmbuilder as lb
from llmbuilder.data import TextDataset
# Configuration
config = lb.load_config(preset="gpu_medium")
model = lb.build_model(config.model)
# Dataset
dataset = TextDataset("data.txt", block_size=config.model.max_seq_length)
# Training (handled automatically)
results = lb.train_model(model, dataset, config.training)
Generation Script Migration¶
Legacy generation:
# sample.py (legacy)
import torch
from model import GPT
import tiktoken
# Load model
model = GPT.from_pretrained('model.pt')
enc = tiktoken.get_encoding("gpt2")
# Generate
context = enc.encode("Hello")
generated = model.generate(context, max_new_tokens=100)
text = enc.decode(generated)
New generation:
# generate.py (new)
import llmbuilder as lb
# Generate (much simpler)
text = lb.generate_text(
model_path="model.pt",
tokenizer_path="tokenizer/",
prompt="Hello",
max_new_tokens=100
)
🖥️ CLI Migration¶
Legacy CLI Commands¶
Old commands:
# Legacy training
python train.py --config config.json --data data.txt
# Legacy generation
python sample.py --model model.pt --prompt "Hello" --num_samples 5
# Legacy data preparation
python prepare_data.py --input raw/ --output data.txt
New CLI Commands¶
New commands:
# New training
llmbuilder train model --config config.json --data data.txt --output model/
# New generation
llmbuilder generate text --model model.pt --tokenizer tokenizer/ --prompt "Hello"
# New data preparation
llmbuilder data load --input raw/ --output data.txt --clean
CLI Command Mapping¶
Legacy Command | New Command | Notes |
---|---|---|
python train.py |
llmbuilder train model |
Unified training interface |
python sample.py |
llmbuilder generate text |
Enhanced generation options |
python prepare_data.py |
llmbuilder data load |
Better data processing |
python eval.py |
llmbuilder model evaluate |
Built-in evaluation |
📊 Data Format Migration¶
Legacy Data Format¶
Old format:
- Single text file with raw concatenated text
- Manual tokenization required
- No metadata or structure
New Data Format¶
New format:
- Multiple input formats supported (PDF, DOCX, etc.)
- Automatic cleaning and preprocessing
- Structured dataset with metadata
- Built-in train/validation splitting
Migration example:
# Legacy data preparation
with open('data.txt', 'r') as f:
text = f.read()
# Tokenize manually
import tiktoken
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode(text)
# New data preparation
from llmbuilder.data import DataLoader, TextDataset
# Load and clean automatically
loader = DataLoader(clean_text=True, remove_duplicates=True)
texts = loader.load_directory("raw_data/")
# Create structured dataset
dataset = TextDataset(
texts,
block_size=1024,
stride=512,
train_split=0.9
)
🔧 Model Architecture Migration¶
Legacy Model Definition¶
Old way:
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
# Manual layer definition...
def forward(self, idx, targets=None):
# Manual forward pass...
New Model Definition¶
New way:
# Use built-in model builder
import llmbuilder as lb
config = lb.load_config(preset="gpu_medium")
model = lb.build_model(config.model)
# Or customize if needed
from llmbuilder.model import GPTModel
from llmbuilder.config import ModelConfig
custom_config = ModelConfig(
vocab_size=32000,
num_layers=16,
num_heads=16,
embedding_dim=1024
)
model = GPTModel(custom_config)
🚀 Deployment Migration¶
Legacy Deployment¶
Old way:
# Manual model saving/loading
torch.save({
'model': model.state_dict(),
'config': config,
'iter_num': iter_num,
}, 'model.pt')
# Manual inference setup
checkpoint = torch.load('model.pt')
model.load_state_dict(checkpoint['model'])
New Deployment¶
New way:
# Automatic model management
from llmbuilder.model import save_model, load_model
# Save with metadata
save_model(model, "model.pt", metadata={
"training_config": config,
"performance_metrics": results
})
# Load with validation
model = load_model("model.pt")
# Export for deployment
from llmbuilder.export import export_gguf
export_gguf("model.pt", "model.gguf", quantization="q4_0")
🔍 Testing Migration¶
Verify Migration Success¶
1. Configuration Test:
import llmbuilder as lb
# Test configuration loading
config = lb.load_config("migrated_config.json")
print(f"Config loaded: {config.model.num_layers} layers")
# Test model building
model = lb.build_model(config.model)
print(f"Model built: {sum(p.numel() for p in model.parameters()):,} parameters")
2. Training Test:
# Test training pipeline
from llmbuilder.data import TextDataset
dataset = TextDataset("test_data.txt", block_size=512)
results = lb.train_model(model, dataset, config.training)
print(f"Training completed: {results.final_loss:.4f} final loss")
3. Generation Test:
# Test generation
text = lb.generate_text(
model_path="model.pt",
tokenizer_path="tokenizer/",
prompt="Test prompt",
max_new_tokens=50
)
print(f"Generated: {text}")
🚨 Common Migration Issues¶
Issue 1: Configuration Validation Errors¶
Problem: Legacy configs fail validation Solution:
from llmbuilder.config import validate_config, fix_config
# Validate and fix automatically
config = lb.load_config("legacy_config.json")
is_valid, errors = validate_config(config)
if not is_valid:
fixed_config = fix_config(config, errors)
fixed_config.save("fixed_config.json")
Issue 2: Model Size Mismatch¶
Problem: Tokenizer vocab size doesn't match model Solution:
# Check and fix vocab size mismatch
from llmbuilder.tokenizer import Tokenizer
tokenizer = Tokenizer.from_pretrained("tokenizer/")
config.model.vocab_size = len(tokenizer)
Issue 3: Performance Regression¶
Problem: New version is slower than legacy Solution:
# Enable performance optimizations
config.system.compile = True # PyTorch 2.0 compilation
config.training.mixed_precision = "fp16" # Mixed precision
config.training.dataloader_num_workers = 4 # Parallel data loading
Issue 4: Generation Quality Differences¶
Problem: Generated text quality differs from legacy Solution:
# Match legacy generation parameters
from llmbuilder.inference import GenerationConfig
legacy_config = GenerationConfig(
temperature=0.8,
top_k=200, # Legacy often used higher top_k
do_sample=True,
repetition_penalty=1.0 # Legacy default
)
text = lb.generate_text(
model_path="model.pt",
tokenizer_path="tokenizer/",
prompt=prompt,
config=legacy_config
)
📚 Migration Examples¶
Complete Migration Script¶
#!/usr/bin/env python3
"""
Complete migration script from legacy LLMBuilder to new package.
"""
import json
import shutil
from pathlib import Path
import llmbuilder as lb
def migrate_project(legacy_dir, new_dir):
"""Migrate entire legacy project to new structure."""
legacy_path = Path(legacy_dir)
new_path = Path(new_dir)
new_path.mkdir(exist_ok=True)
print(f"Migrating {legacy_dir} -> {new_dir}")
# 1. Migrate configuration
legacy_config = legacy_path / "config.json"
if legacy_config.exists():
print("Migrating configuration...")
config = lb.load_config(str(legacy_config))
config.save(str(new_path / "config.json"))
# 2. Copy and organize data
print("Organizing data...")
data_dir = new_path / "data"
data_dir.mkdir(exist_ok=True)
for data_file in legacy_path.glob("*.txt"):
if "data" in data_file.name.lower():
shutil.copy2(data_file, data_dir / data_file.name)
# 3. Migrate model if exists
model_files = list(legacy_path.glob("*.pt"))
if model_files:
print("Migrating model...")
model_dir = new_path / "model"
model_dir.mkdir(exist_ok=True)
for model_file in model_files:
shutil.copy2(model_file, model_dir / model_file.name)
# 4. Create new training script
print("Creating new training script...")
create_training_script(new_path)
print("Migration completed!")
def create_training_script(project_dir):
"""Create new training script using LLMBuilder package."""
script_content = '''#!/usr/bin/env python3
"""
Migrated training script using LLMBuilder package.
"""
import llmbuilder as lb
from llmbuilder.data import TextDataset
def main():
# Load configuration
config = lb.load_config("config.json")
# Build model
model = lb.build_model(config.model)
# Prepare dataset
dataset = TextDataset(
"data/training_data.txt",
block_size=config.model.max_seq_length
)
# Train model
results = lb.train_model(model, dataset, config.training)
print(f"Training completed: {results.final_loss:.4f} final loss")
if __name__ == "__main__":
main()
'''
script_path = project_dir / "train.py"
with open(script_path, 'w') as f:
f.write(script_content)
# Usage
if __name__ == "__main__":
migrate_project("legacy_project/", "new_project/")
✅ Post-Migration Checklist¶
After migration, verify these items:
- [ ] Configuration loads without errors
- [ ] Model builds with correct architecture
- [ ] Training runs successfully
- [ ] Generation produces expected quality
- [ ] Performance meets expectations
- [ ] All CLI commands work
- [ ] Export functionality works
- [ ] Tests pass
🆘 Getting Help¶
If you encounter issues during migration:
- Check the FAQ: Common migration issues are covered
- GitHub Issues: Search for similar migration problems
- GitHub Discussions: Ask the community for help
- Documentation: Review the complete guides
- Examples: Look at working migration examples
Migration Tips
- Start with a small test project before migrating everything
- Keep backups of your legacy code and models
- Test thoroughly after migration
- Take advantage of new features like automatic data cleaning
- Update your deployment scripts to use the new export functionality