Skip to content

LLMBuilder Documentation

🤖 LLMBuilder

A comprehensive toolkit for building, training, and deploying language models

PyPI version Python 3.8+ License: MIT GitHub stars


What is LLMBuilder?

LLMBuilder is a production-ready framework for training and fine-tuning Large Language Models (LLMs) — not a model itself. Designed for developers, researchers, and AI engineers, LLMBuilder provides a complete pipeline to go from raw multi-format documents to deployable, optimized LLMs, with advanced data processing capabilities and support for both CPU and GPU training.

🎯 Key Features

🚀 Easy to Use

  • One-line training: llmbuilder train model --data data.txt --output model/
  • Interactive CLI: Guided setup with llmbuilder welcome
  • Python API: Simple import llmbuilder as lb interface
  • CPU-friendly: Optimized for local development

🔧 Comprehensive

  • Multi-Format Ingestion: HTML, Markdown, EPUB, PDF, TXT processing
  • Advanced Deduplication: Exact and semantic duplicate detection
  • Flexible Tokenization: BPE, SentencePiece, Hugging Face tokenizers
  • Full Training Pipeline: GPT-style transformer training with checkpointing
  • Fine-tuning: LoRA and full parameter fine-tuning
  • Model Export: GGUF conversion with multiple quantization levels

⚡ Performance

  • Memory efficient: Gradient checkpointing and mixed precision
  • Scalable: Single GPU to multi-GPU training
  • Fast inference: Optimized text generation
  • Quantization: 8-bit and 16-bit model compression

🛠️ Developer Friendly

  • Modular design: Use only what you need
  • Extensive docs: Complete API reference and examples
  • Testing: Comprehensive test suite
  • Migration: Easy upgrade from legacy scripts

Quick Example

import llmbuilder as lb

# 1. Process multi-format documents
from llmbuilder.data.ingest import IngestionPipeline
pipeline = IngestionPipeline()
pipeline.process_directory("./raw_docs", "./processed.txt")

# 2. Deduplicate content
from llmbuilder.data.dedup import DeduplicationPipeline
dedup = DeduplicationPipeline()
dedup.process_file("./processed.txt", "./clean.txt")

# 3. Train custom tokenizer
from llmbuilder.tokenizer import TokenizerTrainer
trainer = TokenizerTrainer(algorithm="sentencepiece", vocab_size=16000)
trainer.train("./clean.txt", "./tokenizers")

# 4. Load configuration and build model
cfg = lb.load_config(preset="cpu_small")
model = lb.build_model(cfg.model)

# 5. Train the model
from llmbuilder.data import TextDataset
dataset = TextDataset("./clean.txt", block_size=cfg.model.max_seq_length)
results = lb.train_model(model, dataset, cfg.training)

# 6. Convert to GGUF format
from llmbuilder.tools.convert_to_gguf import GGUFConverter
converter = GGUFConverter()
converter.convert_model("./checkpoints/model.pt", "./model.gguf", "Q8_0")

# 7. Generate text
text = lb.generate_text(
    model_path="./checkpoints/model.pt",
    tokenizer_path="./tokenizers",
    prompt="The future of AI is",
    max_new_tokens=50
)
print(text)

Architecture Overview

graph TB
    A[Multi-Format Documents<br/>HTML, Markdown, EPUB, PDF, TXT] --> B[Ingestion Pipeline]
    B --> C[Text Normalization]
    C --> D[Deduplication<br/>Exact & Semantic]
    D --> E[Tokenizer Training<br/>BPE, SentencePiece, HF]
    E --> F[Dataset Creation]
    F --> G[Model Training<br/>GPT Architecture]
    G --> H[Checkpoints & Validation]
    H --> I[Text Generation]
    H --> J[GGUF Conversion<br/>Multiple Quantization Levels]

    style A fill:#e1f5fe
    style D fill:#f3e5f5
    style E fill:#e8f5e8
    style I fill:#e8f5e8
    style J fill:#fff3e0

Getting Started

Choose your path to get started with LLMBuilder:

📚 Documentation Sections

  • Quick Start - Get up and running in 5 minutes
  • Installation - Install LLMBuilder and set up your environment
  • First Model - Train your first language model step by step
  • User Guide - Comprehensive guides for all features and capabilities

Use Cases

Research & Experimentation

Perfect for researchers who need to quickly prototype and experiment with different model architectures, training strategies, and datasets.

Educational Projects

Ideal for students and educators learning about transformer models, with clear examples and comprehensive documentation.

Production Deployment

Ready for production use with model export, quantization, and optimization features for deployment at scale.

Domain-Specific Models

Fine-tune models on your specific domain data for improved performance on specialized tasks.

Community & Support


Built with ❤️ by Qub△se

Empowering developers to create amazing AI applications