Skip to content

LLMBuilder CLI Guide

This document explains how to use the LLMBuilder command-line interface. It lists all available commands, their inputs/outputs, and Windows-friendly examples.

  • Executable: llmbuilder
  • Show help: llmbuilder --help
  • Version: llmbuilder --version

Tip for Windows: wrap prompts in double quotes, and use full paths when in doubt.


Global

  • llmbuilder --help — Show top-level help.
  • llmbuilder --version — Show package version.
  • llmbuilder -v — Enable verbose mode for top-level command (shows extra output in some subcommands).
  • llmbuilder welcome — Friendly welcome and quick actions.
  • llmbuilder info — Display package information and module overview.

Examples:

  • llmbuilder info
  • llmbuilder welcome

Data Commands

Group: llmbuilder data

1) Load and preprocess text

llmbuilder data load — Load text from files/directories, optionally clean, and save to a single file.

Inputs:

  • --input, -i (file or directory)
  • --output, -o (output .txt file)
  • --format [txt|pdf|docx|all] (default: all)
  • --clean (flag to clean text)
  • --min-length (minimum text length to keep; default: 50)
  • --interactive (guided mode)

Outputs:

  • A single text file with processed content; progress messages.

Examples:

  • Directory to single file: llmbuilder data load -i D:\data -o D:\out\combined.txt --clean --min-length 80
  • Single file: llmbuilder data load -i D:\docs\paper.pdf -o D:\out\paper.txt
  • Interactive: llmbuilder data load --interactive

2) Ingest multi-format documents

llmbuilder data ingest — Batch-process multiple formats with optional OCR fallback.

Inputs:

  • --input, -i (file or directory)
  • --output, -o (output directory)
  • --formats (multiple) [html|markdown|epub|pdf|all] (default: all)
  • --batch-size (default: 100)
  • --workers (default: 4)
  • --ocr-fallback (flag for PDFs)
  • --verbose, -v

Outputs:

  • Processed files written into the output directory and a summary (processed/succeeded/failed).

Example:

  • llmbuilder data ingest -i D:\raw_docs -o D:\processed --formats html --formats pdf --ocr-fallback -v

3) Deduplicate text

llmbuilder data deduplicate — Remove exact/semantic duplicates from text.

Inputs:

  • --input, -i (file or directory containing .txt files)
  • --output, -o (output file for deduplicated lines)
  • --method [exact|semantic|both] (default: both)
  • --similarity-threshold (0.0–1.0, default: 0.85)
  • --batch-size (default: 1000)
  • --embedding-model (default: all-MiniLM-L6-v2)
  • --verbose, -v

Outputs:

  • Deduplicated text file, with stats printed.

Example:

  • llmbuilder data deduplicate -i D:\text_corpus -o D:\out\dedup.txt --method both --similarity-threshold 0.9 -v

Tokenizer Commands

Group: llmbuilder tokenizer

1) Train tokenizer

llmbuilder tokenizer train

Inputs:

  • --input, -i (text file or directory)
  • --output, -o (output directory)
  • --vocab-size (default: 16000)
  • --algorithm [bpe|unigram|wordpiece|sentencepiece] (default: bpe)
  • --special-tokens (multiple) e.g. <pad> <unk>
  • --min-frequency (default: 2)
  • --coverage (SentencePiece only, default: 0.9995)
  • --validate (flag)
  • --verbose, -v

Outputs:

  • Tokenizer model and vocab files saved under the output directory with training stats.

Examples:

  • BPE: llmbuilder tokenizer train -i D:\out\combined.txt -o D:\out\tokenizer --vocab-size 8000 --algorithm bpe --validate
  • SentencePiece: llmbuilder tokenizer train -i D:\out\combined.txt -o D:\out\sp_tokenizer --algorithm sentencepiece --coverage 0.999

2) Test tokenizer

llmbuilder tokenizer test

Inputs:

  • tokenizer_path (positional; directory containing tokenizer)
  • One of:
  • --text, -t (quick test string)
  • --file, -f (file to tokenize)
  • --interactive, -i (interactive loop)

Outputs:

  • Encoded tokens, decoded text, and token counts.

Examples:

  • Single text: llmbuilder tokenizer test D:\out\tokenizer -t "Hello world"
  • Interactive: llmbuilder tokenizer test D:\out\tokenizer -i

Training Commands

Group: llmbuilder train

1) Train a model from scratch

llmbuilder train model

Inputs:

  • --data, -d (path to tokenized dataset or compatible data file)
  • --tokenizer, -t (tokenizer directory)
  • --output, -o (checkpoint output directory)
  • --epochs (default: 10)
  • --batch-size (default: 32)
  • --lr (default: 3e-4)
  • --vocab-size (default: 16000)
  • --layers (default: 8)
  • --heads (default: 8)
  • --dim (embedding dim; default: 512)
  • --config, -c (optional config file, if supported)

Outputs:

  • Training progress, checkpoints in output directory.

Example:

  • llmbuilder train model -d D:\LLM\Model_Test\output\tokenized_data.pt -t D:\LLM\Model_Test\output\tokenizer -o D:\LLM\Model_Test\output\checkpoints --epochs 5 --batch-size 1 --lr 6e-4 --vocab-size 8000 --layers 4 --heads 8 --dim 256

2) Resume training

llmbuilder train resume

Inputs:

  • --checkpoint, -c (path to existing checkpoint)
  • --data, -d (dataset path)
  • --output, -o (optional; defaults to checkpoint directory)

Outputs:

  • Training resumed; new checkpoints written.

Example:

  • llmbuilder train resume -c D:\LLM\Model_Test\output\checkpoints\checkpoint_epoch_2.pt -d D:\LLM\Model_Test\output\tokenized_data.pt

Fine-tuning Commands

Group: llmbuilder finetune

Fine-tune a pre-trained model

llmbuilder finetune model

Inputs:

  • --model, -m (path to pre-trained model)
  • --dataset, -d (dataset path)
  • --output, -o (output directory)
  • --epochs (default: 3)
  • --lr (default: 5e-5)
  • --batch-size (default: 4)
  • --use-lora (flag)
  • --lora-rank (default: 4)

Outputs:

  • Fine-tuned model checkpoints and summary (best validation loss etc.).

Example:

  • llmbuilder finetune model -m D:\models\base.pt -d D:\data\fine_tune.pt -o D:\out\ft --epochs 3 --use-lora

Generation Commands

Group: llmbuilder generate

Generate text

llmbuilder generate text

Inputs:

  • --model, -m (checkpoint path). Use latest_checkpoint.pt or a specific checkpoint_epoch_X.pt.
  • --tokenizer, -t (tokenizer directory)
  • --prompt, -p (text prompt)
  • --interactive, -i (interactive chat-like mode)
  • --max-tokens (default: 100)
  • --temperature (default: 0.8)
  • --top-k (default: 50)
  • --top-p (default: 0.9)
  • --device (cpu|cuda; optional)
  • --setup (guided setup)

Outputs:

  • Generated text printed to console; in interactive mode, a live session.

Examples:

  • One-shot: llmbuilder generate text -m D:\LLM\Model_Test\output\checkpoints\latest_checkpoint.pt -t D:\LLM\Model_Test\output\tokenizer -p "Cybersecurity is important because" --max-tokens 120 --temperature 0.8 --top-k 50 --top-p 0.9
  • Interactive: llmbuilder generate text -m D:\LLM\Model_Test\output\checkpoints\latest_checkpoint.pt -t D:\LLM\Model_Test\output\tokenizer --interactive
  • Guided setup: llmbuilder generate text --setup

Notes:

  • If best_checkpoint.pt is missing, use latest_checkpoint.pt.
  • Ensure your tokenizer matches the model used for training.

Model Management Commands

Group: llmbuilder model

1) Create a new empty model

llmbuilder model create

Inputs:

  • --output, -o (path to save)
  • --vocab-size (default: 16000)
  • --layers (default: 8)
  • --heads (default: 8)
  • --dim (default: 512)
  • --config, -c (optional)

Outputs:

  • Saved model file and parameter counts printed.

Example:

  • llmbuilder model create -o D:\models\new.pt --vocab-size 8000 --layers 4 --heads 8 --dim 256

2) Show model info

llmbuilder model info <model_path>

Outputs:

  • Total/trainable parameters and architecture details (if available).

Example:

  • llmbuilder model info D:\LLM\Model_Test\output\checkpoints\latest_checkpoint.pt

3) Evaluate a model

llmbuilder model evaluate

Inputs:

  • --dataset, -d (evaluation dataset path)
  • --batch-size (default: 32)

Outputs:

  • Evaluation metrics (loss, perplexity if available).

Example:

  • llmbuilder model evaluate D:\LLM\Model_Test\output\checkpoints\latest_checkpoint.pt -d D:\LLM\Model_Test\output\tokenized_data.pt

Export Commands

Group: llmbuilder export

Export to GGUF (llama.cpp compatibility)

llmbuilder export gguf

Inputs:

  • model_path (positional; path to source model/checkpoint)
  • --output, -o (output GGUF file path)
  • --quantization [Q8_0|Q4_0|Q4_1|Q5_0|Q5_1|F16|F32] (default: Q8_0)
  • --validate (flag; run post-conversion validation if available)
  • --verbose, -v (flag; print extra diagnostics and script discovery)

Outputs:

  • A .gguf model file written to the output path. Summary includes size, time, and quantization details. On --validate, prints validation result.

Examples:

  • Convert with default Q8_0: llmbuilder export gguf D:\LLM\Model_Test\output\checkpoints\latest_checkpoint.pt -o D:\LLM\Model_Test\output\models\latest_q8.gguf
  • Convert with Q4_0 and validate: llmbuilder export gguf D:\models\my_model.pt -o D:\models\my_model_q4.gguf --quantization Q4_0 --validate -v

Notes:

  • Ensure sufficient disk space for the converted file.
  • Quantization level affects size and speed/quality tradeoffs.

Tips & Troubleshooting

  • No best checkpoint: If validation is disabled or failed to split, only latest_checkpoint.pt is produced. Use that for generation. After upgrading to a version with robust splits and retraining, best_checkpoint.pt will appear.
  • Colors on Windows: Colors are enabled automatically via colorama. If you don’t see colors, ensure llmbuilder>=0.4.5 is installed.
  • Progress bars: Data loading, training, and validation show tqdm bars when console output is enabled.
  • Paths: Prefer absolute paths on Windows to avoid confusion.
  • Help: Every command supports --help.

Quick Start Workflow

1) Prepare data

llmbuilder data load -i D:\data -o D:\LLM\Model_Test\output\processed_data.txt --clean

2) Train tokenizer

llmbuilder tokenizer train -i D:\LLM\Model_Test\output\processed_data.txt -o D:\LLM\Model_Test\output\tokenizer --vocab-size 8000 --algorithm bpe --validate

3) Train model

llmbuilder train model -d D:\LLM\Model_Test\output\tokenized_data.pt -t D:\LLM\Model_Test\output\tokenizer -o D:\LLM\Model_Test\output\checkpoints --epochs 5 --batch-size 1 --lr 6e-4 --vocab-size 8000 --layers 4 --heads 8 --dim 256

4) Generate text

llmbuilder generate text -m D:\LLM\Model_Test\output\checkpoints\latest_checkpoint.pt -t D:\LLM\Model_Test\output\tokenizer -p "Cybersecurity is important because" --max-tokens 120

Cross‑platform Quick Commands

Below are ready-to-run snippets for Windows PowerShell and Linux/macOS Bash/Zsh.

Data: Load

PowerShell

llmbuilder data load -i "D:\data" -o "D:\LLM\Model_Test\output\processed_data.txt" --clean --min-length 80

Bash/Zsh

llmbuilder data load -i "/data" -o "/mnt/llm/Model_Test/output/processed_data.txt" --clean --min-length 80

Data: Ingest

PowerShell

llmbuilder data ingest -i "D:\raw_docs" -o "D:\processed" --formats html --formats pdf --ocr-fallback -v

Bash/Zsh

llmbuilder data ingest -i "/data/raw_docs" -o "/data/processed" --formats html --formats pdf --ocr-fallback -v

Data: Deduplicate

PowerShell

llmbuilder data deduplicate -i "D:\text_corpus" -o "D:\out\dedup.txt" --method both --similarity-threshold 0.9 -v

Bash/Zsh

llmbuilder data deduplicate -i "/data/text_corpus" -o "/data/out/dedup.txt" --method both --similarity-threshold 0.9 -v

Tokenizer: Train (BPE)

PowerShell

llmbuilder tokenizer train -i "D:\LLM\Model_Test\output\processed_data.txt" -o "D:\LLM\Model_Test\output\tokenizer" --vocab-size 8000 --algorithm bpe --validate

Bash/Zsh

llmbuilder tokenizer train -i "/mnt/llm/Model_Test/output/processed_data.txt" -o "/mnt/llm/Model_Test/output/tokenizer" --vocab-size 8000 --algorithm bpe --validate

Tokenizer: Test (interactive)

PowerShell

llmbuilder tokenizer test "D:\LLM\Model_Test\output\tokenizer" -i

Bash/Zsh

llmbuilder tokenizer test "/mnt/llm/Model_Test/output/tokenizer" -i

Train: From scratch

PowerShell

llmbuilder train model -d "D:\LLM\Model_Test\output\tokenized_data.pt" -t "D:\LLM\Model_Test\output\tokenizer" -o "D:\LLM\Model_Test\output\checkpoints" --epochs 5 --batch-size 1 --lr 6e-4 --vocab-size 8000 --layers 4 --heads 8 --dim 256

Bash/Zsh

llmbuilder train model -d "/mnt/llm/Model_Test/output/tokenized_data.pt" -t "/mnt/llm/Model_Test/output/tokenizer" -o "/mnt/llm/Model_Test/output/checkpoints" --epochs 5 --batch-size 1 --lr 6e-4 --vocab-size 8000 --layers 4 --heads 8 --dim 256

Train: Resume

PowerShell

llmbuilder train resume -c "D:\LLM\Model_Test\output\checkpoints\checkpoint_epoch_2.pt" -d "D:\LLM\Model_Test\output\tokenized_data.pt"

Bash/Zsh

llmbuilder train resume -c "/mnt/llm/Model_Test/output/checkpoints/checkpoint_epoch_2.pt" -d "/mnt/llm/Model_Test/output/tokenized_data.pt"

Generate: One-shot

PowerShell

llmbuilder generate text -m "D:\LLM\Model_Test\output\checkpoints\latest_checkpoint.pt" -t "D:\LLM\Model_Test\output\tokenizer" -p "Cybersecurity is important because" --max-tokens 120 --temperature 0.8 --top-k 50 --top-p 0.9

Bash/Zsh

llmbuilder generate text -m "/mnt/llm/Model_Test/output/checkpoints/latest_checkpoint.pt" -t "/mnt/llm/Model_Test/output/tokenizer" -p "Cybersecurity is important because" --max-tokens 120 --temperature 0.8 --top-k 50 --top-p 0.9

Generate: Interactive

PowerShell

llmbuilder generate text -m "D:\LLM\Model_Test\output\checkpoints\latest_checkpoint.pt" -t "D:\LLM\Model_Test\output\tokenizer" --interactive

Bash/Zsh

llmbuilder generate text -m "/mnt/llm/Model_Test/output/checkpoints/latest_checkpoint.pt" -t "/mnt/llm/Model_Test/output/tokenizer" --interactive

Model: Create

PowerShell

llmbuilder model create -o "D:\models\new.pt" --vocab-size 8000 --layers 4 --heads 8 --dim 256

Bash/Zsh

llmbuilder model create -o "/models/new.pt" --vocab-size 8000 --layers 4 --heads 8 --dim 256

Model: Info

PowerShell

llmbuilder model info "D:\LLM\Model_Test\output\checkpoints\latest_checkpoint.pt"

Bash/Zsh

llmbuilder model info "/mnt/llm/Model_Test/output/checkpoints/latest_checkpoint.pt"

Model: Evaluate

PowerShell

llmbuilder model evaluate "D:\LLM\Model_Test\output\checkpoints\latest_checkpoint.pt" -d "D:\LLM\Model_Test\output\tokenized_data.pt" --batch-size 8

Bash/Zsh

llmbuilder model evaluate "/mnt/llm/Model_Test/output/checkpoints/latest_checkpoint.pt" -d "/mnt/llm/Model_Test/output/tokenized_data.pt" --batch-size 8

Export: GGUF (Q8_0)

PowerShell

llmbuilder export gguf "D:\LLM\Model_Test\output\checkpoints\latest_checkpoint.pt" -o "D:\LLM\Model_Test\output\models\latest_q8.gguf"

Bash/Zsh

llmbuilder export gguf "/mnt/llm/Model_Test/output/checkpoints/latest_checkpoint.pt" -o "/mnt/llm/Model_Test/output/models/latest_q8.gguf"

Export: GGUF (Q4_0, validate)

PowerShell

llmbuilder export gguf "D:\models\my_model.pt" -o "D:\models\my_model_q4.gguf" --quantization Q4_0 --validate -v

Bash/Zsh

llmbuilder export gguf "/models/my_model.pt" -o "/models/my_model_q4.gguf" --quantization Q4_0 --validate -v

Screenshots

Add images under docs/images/ and reference them here or in the README.

  • Welcome screen: ![Welcome](docs/images/cli-welcome.png)
  • Data load progress: ![Data Load](docs/images/cli-data-load.png)
  • Training progress bars: ![Training](docs/images/cli-training.png)
  • Validation loop: ![Validation](docs/images/cli-validation.png)
  • Generate (interactive): ![Generate Interactive](docs/images/cli-generate-interactive.png)
  • Export GGUF: ![Export GGUF](docs/images/cli-export-gguf.png)

Tip (Windows): Alt+PrintScreen captures the active window; save PNG to docs/images/.


Maintained by Qub△se. For more, see the repository wiki and llmbuilder info.