Skip to content

Quick Start

Get up and running with LLMBuilder in just 5 minutes! This guide will walk you through training your first language model.

🚀 5-Minute Setup

Step 1: Install LLMBuilder

pip install llmbuilder

Step 2: Prepare Your Data

Create a simple text file with some training data:

# Create a sample data file
echo "Artificial intelligence is transforming the world. Machine learning enables computers to learn from data. Deep learning uses neural networks to solve complex problems." > sample_data.txt

Step 3: Train Your First Model

Use the interactive CLI to train a model:

llmbuilder welcome

Or use the direct command:

# Train a small model (perfect for testing)
llmbuilder train model \
  --data sample_data.txt \
  --tokenizer ./tokenizer \
  --output ./my_first_model \
  --epochs 5 \
  --batch-size 2

Step 3 (Alternative): Use the bundled example (Model_Test)

If you cloned the repo, we include a tiny cybersecurity dataset under Model_Test/Data/ and a ready-to-run script in docs/train_model.py. Note: the script auto-detects a nested Data/ folder if you pass --data_dir ./Model_Test.

# Train using the example dataset (explicit Data/ path)
python docs/train_model.py --data_dir ./Model_Test/Data --output_dir ./Model_Test/output \
  --epochs 5 --batch_size 1 --block_size 64 --embed_dim 256 --layers 4 --heads 8 \
  --prompt "Cybersecurity is important because"

# After training, generate again any time without retraining
python -c "import llmbuilder; print(llmbuilder.generate_text(\
  model_path=r'.\\Model_Test\\output\\checkpoints\\latest_checkpoint.pt', \
  tokenizer_path=r'.\\Model_Test\\output\\tokenizer', \
  prompt='what is Cybersecurity', max_new_tokens=80, temperature=0.8, top_p=0.9))"

Outputs are created in Model_Test/output/:

  • Model_Test/output/tokenizer/ – trained tokenizer
  • Model_Test/output/checkpoints/ – model checkpoints (latest, epoch_*.pt)

Step 4: Generate Text

# Generate text with your trained model
llmbuilder generate text \
  --model ./my_first_model/model.pt \
  --tokenizer ./tokenizer \
  --prompt "Artificial intelligence" \
  --max-tokens 50

🎉 Congratulations! You've just trained and used your first language model with LLMBuilder!

🐍 Python API Quick Start

Prefer Python code? Here's the same workflow using the Python API:

import llmbuilder as lb

# 1. Load configuration
cfg = lb.load_config(preset="cpu_small")

# 2. Build model
model = lb.build_model(cfg.model)

# 3. Prepare data
from llmbuilder.data import TextDataset
dataset = TextDataset("sample_data.txt", block_size=cfg.model.max_seq_length)

# 4. Train model
results = lb.train_model(model, dataset, cfg.training)

# 5. Generate text
text = lb.generate_text(
    model_path="./checkpoints/model.pt",
    tokenizer_path="./tokenizers",
    prompt="Artificial intelligence",
    max_new_tokens=50
)
print(text)

📊 Understanding the Output

When training completes, you'll see output like:

✅ Training completed successfully!
📊 Final Results:
  • Training Loss: 2.45
  • Validation Loss: 2.52
  • Training Time: 3m 24s
  • Model Parameters: 2.1M
  • Model Size: 8.4MB

💾 Outputs saved to:
  • Model: ./my_first_model/model.pt
  • Tokenizer: ./tokenizer/
  • Logs: ./my_first_model/training.log

🎯 What Just Happened?

Let's break down what LLMBuilder did:

  1. Data Processing: Loaded and cleaned your text data
  2. Tokenization: Created a vocabulary and tokenized the text
  3. Model Creation: Built a GPT-style transformer model
  4. Training: Trained the model to predict the next token
  5. Saving: Saved the model and tokenizer for later use

🔧 Customization Options

Different Model Sizes

# Tiny model (fastest, least memory)
llmbuilder train model --config-preset tiny --data data.txt --output model/

# Small model (balanced)
llmbuilder train model --config-preset cpu_small --data data.txt --output model/

# Medium model (better quality, needs more resources)
llmbuilder train model --config-preset gpu_medium --data data.txt --output model/

Different Data Formats

LLMBuilder supports multiple input formats:

# Process PDF files
llmbuilder data load --input documents/ --output clean_text.txt --format pdf

# Process DOCX files
llmbuilder data load --input documents/ --output clean_text.txt --format docx

# Process all supported formats
llmbuilder data load --input documents/ --output clean_text.txt --format all

Interactive Mode

For a guided experience:

# Interactive training setup
llmbuilder train model --interactive

# Interactive text generation
llmbuilder generate text --setup

📈 Next Steps

Now that you have a working model, here are some things to try:

1. Improve Your Model

# Train for more epochs
llmbuilder train model --data data.txt --output model/ --epochs 20

# Use a larger model
llmbuilder train model --data data.txt --output model/ --layers 12 --dim 768

# Add more training data
llmbuilder data load --input more_documents/ --output bigger_dataset.txt

2. Fine-tune on Specific Data

# Fine-tune your model on domain-specific data
llmbuilder finetune model \
  --model ./my_first_model/model.pt \
  --dataset domain_specific_data.txt \
  --output ./fine_tuned_model

3. Export for Production

# Export to GGUF format for llama.cpp
llmbuilder export gguf ./my_first_model/model.pt --output model.gguf

# Export to ONNX for mobile/edge deployment
llmbuilder export onnx ./my_first_model/model.pt --output model.onnx

4. Advanced Generation

import llmbuilder as lb

# Interactive chat-like generation
lb.interactive_cli(
    model_path="./my_first_model/model.pt",
    tokenizer_path="./tokenizer",
    temperature=0.8,
    top_k=50
)

🛠️ Configuration Presets

LLMBuilder comes with several built-in presets:

Preset Use Case Memory Training Time
tiny Testing, debugging ~1GB Minutes
cpu_small CPU training, learning ~2GB Hours
gpu_medium Single GPU training ~8GB Hours
gpu_large High-end GPU training ~16GB+ Days

🔍 Monitoring Training

Real-time Monitoring

# Monitor training progress
tail -f ./my_first_model/training.log

# Or use the built-in progress display
llmbuilder train model --data data.txt --output model/ --verbose

Training Metrics

LLMBuilder tracks important metrics:

  • Loss: How well the model is learning
  • Perplexity: Model confidence (lower is better)
  • Learning Rate: Training speed
  • Memory Usage: Resource consumption

🚨 Common Issues & Solutions

Out of Memory

# Reduce batch size
llmbuilder train model --data data.txt --output model/ --batch-size 1

# Use CPU-only mode
llmbuilder train model --data data.txt --output model/ --device cpu

Slow Training

# Use GPU if available
llmbuilder train model --data data.txt --output model/ --device cuda

# Reduce model size
llmbuilder train model --data data.txt --output model/ --layers 4 --dim 256

Poor Generation Quality

# Train for more epochs
llmbuilder train model --data data.txt --output model/ --epochs 50

# Use more training data
# Add more text files to your dataset

# Adjust generation parameters
llmbuilder generate text --model model.pt --tokenizer tokenizer/ \
  --prompt "Your prompt" --temperature 0.7 --top-k 40

📚 Learn More

Ready to dive deeper? Check out these resources:


You're Ready!

You now have a working LLMBuilder setup! The model you just trained might be small, but you've learned the complete workflow. Try experimenting with different data, model sizes, and generation parameters to see what works best for your use case.