Frequently Asked Questions¶

Common questions and answers about LLMBuilder usage, troubleshooting, and best practices.

🚀 Getting Started¶

Q: What are the minimum system requirements?¶

A: LLMBuilder requires:

Python 3.8+ (3.9+ recommended)
4GB RAM minimum (8GB+ recommended)
2GB free disk space for installation and basic models
Optional: NVIDIA GPU with 4GB+ VRAM for faster training

Q: Should I use CPU or GPU for training?¶

A:

CPU: Good for learning, small models, and development. Use preset="cpu_small"
GPU: Recommended for production training and larger models. Use preset="gpu_medium" or preset="gpu_large"
Mixed: Start with CPU for prototyping, then move to GPU for final training

Q: How long does it take to train a model?¶

A: Training time depends on several factors:

Small model (10M params): 30 minutes - 2 hours on CPU, 5-15 minutes on GPU
Medium model (50M params): 2-8 hours on CPU, 30 minutes - 2 hours on GPU
Large model (200M+ params): Days on CPU, 2-12 hours on GPU

🔧 Configuration¶

Q: Which configuration preset should I use?¶

A: Choose based on your hardware and use case:

Preset	Use Case	Hardware	Model Size	Training Time
`tiny`	Testing, debugging	Any	~1M params	Minutes
`cpu_small`	Learning, development	CPU	~10M params	Hours
`gpu_medium`	Production training	Single GPU	~50M params	Hours
`gpu_large`	High-quality models	High-end GPU	~200M+ params	Days

Q: How do I customize model architecture?¶

A: Modify the model configuration:

from llmbuilder.config import ModelConfig

config = ModelConfig(
    vocab_size=16000,      # Match your tokenizer
    num_layers=12,         # More layers = more capacity
    num_heads=12,          # Should divide embedding_dim evenly
    embedding_dim=768,     # Larger = more capacity
    max_seq_length=1024,   # Longer sequences = more memory
    dropout=0.1            # Higher = more regularization
)

Q: What vocabulary size should I use?¶

A: Vocabulary size depends on your data and use case:

8K-16K: Small datasets, specific domains
16K-32K: General purpose, balanced size
32K-64K: Large datasets, multilingual models
64K+: Very large datasets, maximum coverage

📊 Data and Training¶

Q: How much training data do I need?¶

A: Data requirements vary by model size and quality goals:

Minimum: 1MB of text (~200K words) for basic functionality
Recommended: 10MB+ of text (~2M words) for good quality
Optimal: 100MB+ of text (~20M words) for high quality
Production: 1GB+ of text (~200M words) for best results

Q: What file formats are supported for training data?¶

A: LLMBuilder supports:

Text files: .txt, .md (best quality)
Documents: .pdf, .docx (good quality)
Web content: .html, .htm (moderate quality)
Presentations: .pptx (basic support)
Data files: .csv, .json (with proper formatting)

Q: How do I handle out-of-memory errors?¶

A: Try these solutions in order:

Reduce batch size:

config.training.batch_size = 4  # or even 2

Enable gradient checkpointing:

config.model.gradient_checkpointing = True

Use gradient accumulation:

config.training.gradient_accumulation_steps = 4

Reduce sequence length:

config.model.max_seq_length = 512

Use CPU training:

config.system.device = "cpu"

Q: My model isn't learning (loss not decreasing). What's wrong?¶

A: Common causes and solutions:

Learning rate too high: Reduce to 1e-4 or 1e-5
Learning rate too low: Increase to 3e-4 or 5e-4
Bad data: Check for corrupted or repetitive text
Wrong tokenizer: Ensure vocab_size matches tokenizer
Insufficient warmup: Increase warmup_steps to 1000+

Q: How do I know if my model is overfitting?¶

A: Signs of overfitting:

Training loss decreases but validation loss increases
Generated text is repetitive or memorized
Model performs poorly on new data

Solutions:

Increase dropout rate (0.1 → 0.2)
Add weight decay (0.01)
Use early stopping
Get more training data
Reduce model size

🎯 Text Generation¶

Q: How do I improve generation quality?¶

A: Try these techniques:

Adjust temperature:
Lower (0.3-0.7): More focused, predictable
Higher (0.8-1.2): More creative, diverse
Use nucleus sampling:

config = GenerationConfig(
    temperature=0.8,
    top_p=0.9,      # Nucleus sampling
    top_k=50        # Top-k sampling
)

Add repetition penalty:

config.repetition_penalty = 1.1

Better prompts:
Be specific and clear
Provide context and examples
Use consistent formatting

Q: Why is my generated text repetitive?¶

A: Common causes and fixes:

Insufficient training: Train for more epochs
Poor sampling: Use top-p/top-k sampling instead of greedy
Low temperature: Increase temperature to 0.8+
Add repetition penalty: Set to 1.1-1.3
Prevent n-gram repetition: Set no_repeat_ngram_size=3

Q: How do I make generation faster?¶

A: Speed optimization techniques:

Use GPU: Much faster than CPU
Reduce max_tokens: Generate shorter responses
Use greedy decoding: Set do_sample=False
Enable model compilation: Set compile=True (PyTorch 2.0+)
Quantize model: Use 8-bit or 16-bit precision

🔄 Fine-tuning¶

Q: When should I fine-tune vs. train from scratch?¶

A:

Fine-tune when: You have a pre-trained model and domain-specific data
Train from scratch when: You have lots of data and need full control
Fine-tuning advantages: Faster, less data needed, preserves general knowledge
Training advantages: Full customization, no dependency on base model

Q: What's the difference between LoRA and full fine-tuning?¶

A:

Aspect	LoRA	Full Fine-tuning
Memory	Low (~1% of params)	High (all params)
Speed	Fast	Slower
Quality	Good for most tasks	Best possible
Flexibility	Limited adaptation	Full adaptation
Use case	Domain adaptation	Major architecture changes

Q: How do I prevent catastrophic forgetting during fine-tuning?¶

A: Use these techniques:

Lower learning rate: 1e-5 to 5e-5
Fewer epochs: 3-5 epochs usually sufficient
Regularization: Add weight decay (0.01)
LoRA: Preserves base model weights
Mixed training: Include general data with domain data

🚀 Deployment¶

Q: How do I deploy my trained model?¶

A: LLMBuilder supports multiple deployment options:

GGUF format (for llama.cpp):

llmbuilder export gguf model.pt --output model.gguf --quantization q4_0

ONNX format (for cross-platform):

llmbuilder export onnx model.pt --output model.onnx

Quantized PyTorch (for production):

llmbuilder export quantize model.pt --output model_int8.pt --bits 8

Q: Which export format should I choose?¶

A: Choose based on your deployment target:

GGUF: CPU inference, llama.cpp compatibility, edge devices
ONNX: Cross-platform, mobile apps, cloud services
Quantized PyTorch: PyTorch ecosystem, balanced performance
HuggingFace: Easy sharing, transformers compatibility

Q: How do I reduce model size for deployment?¶

A: Size reduction techniques:

Quantization: 8-bit (50% smaller) or 4-bit (75% smaller)
Pruning: Remove least important weights
Distillation: Train smaller model to mimic larger one
Architecture optimization: Use efficient attention mechanisms

🐛 Troubleshooting¶

Q: I get "CUDA out of memory" errors. What should I do?¶

A: Try these solutions:

Reduce batch size: Start with batch_size=1
Enable gradient checkpointing: Trades compute for memory
Use gradient accumulation: Simulate larger batches
Reduce sequence length: Shorter sequences use less memory
Use CPU: Slower but no memory limits
Clear GPU cache: torch.cuda.empty_cache()

Q: Training is very slow. How can I speed it up?¶

A: Speed optimization:

Use GPU: 10-100x faster than CPU
Increase batch size: Better GPU utilization
Enable mixed precision: fp16 or bf16
Use multiple GPUs: Distributed training
Optimize data loading: More workers, pin memory
Compile model: PyTorch 2.0 compilation

Q: My tokenizer produces weird results. What's wrong?¶

A: Common tokenizer issues:

Wrong vocabulary size: Must match model config
Insufficient training data: Need diverse text corpus
Character coverage too low: Increase to 0.9999
Wrong model type: BPE usually works best
Missing special tokens: Include <pad>, <unk>, etc.

Q: Generated text contains strange characters or formatting¶

A: Text cleaning solutions:

Improve data cleaning: Remove unwanted characters
Filter by language: Keep only desired languages
Normalize text: Fix encoding issues
Add text filters: Remove specific patterns
Better tokenizer training: Use cleaner training data

💡 Best Practices¶

Q: What are the most important best practices?¶

A: Key recommendations:

Start small: Begin with tiny models and scale up
Clean your data: Quality over quantity
Monitor training: Watch loss curves and generation quality
Save checkpoints: Protect against failures
Validate everything: Test configurations before long training
Document experiments: Keep track of what works

Q: How do I choose hyperparameters?¶

A: Hyperparameter selection guide:

Learning rate: Start with 3e-4, adjust based on loss curves
Batch size: Largest that fits in memory
Model size: Balance quality needs with resources
Sequence length: Match your use case requirements
Dropout: 0.1 is usually good, increase if overfitting

Q: How do I evaluate model quality?¶

A: Evaluation methods:

Perplexity: Lower is better (< 20 is good)
Generation quality: Manual inspection of outputs
Task-specific metrics: BLEU, ROUGE for specific tasks
Human evaluation: Best but most expensive
Automated metrics: Coherence, fluency scores

🆘 Getting Help¶

Q: Where can I get help if I'm stuck?¶

A: Support resources:

Documentation: Complete guides and examples
GitHub Issues: Report bugs and request features
GitHub Discussions: Community Q&A
Examples: Working code samples
Stack Overflow: Tag questions with llmbuilder

Q: How do I report a bug?¶

A: When reporting bugs, include:

LLMBuilder version: llmbuilder --version
Python version: python --version
Operating system: Windows/macOS/Linux
Hardware: CPU/GPU specifications
Error message: Full traceback
Minimal example: Code to reproduce the issue
Configuration: Model and training configs used

Q: How can I contribute to LLMBuilder?¶

A: Ways to contribute:

Report bugs: Help improve stability
Request features: Suggest improvements
Submit PRs: Code contributions welcome
Improve docs: Fix typos, add examples
Share examples: Help other users
Test releases: Try beta versions

Still have questions?

If you can't find the answer here, check our GitHub Discussions or create a new issue. The community is always happy to help!