Text Generation¶
Text generation is where your trained language model comes to life. LLMBuilder provides powerful and flexible text generation capabilities with various sampling strategies, interactive modes, and customization options.
🎯 Generation Overview¶
Text generation transforms your trained model into a creative writing assistant:
graph LR
A[Prompt] --> B[Tokenizer]
B --> C[Model]
C --> D[Sampling Strategy]
D --> E[Generated Tokens]
E --> F[Detokenizer]
F --> G[Generated Text]
D --> D1[Greedy]
D --> D2[Top-k]
D --> D3[Top-p]
D --> D4[Temperature]
style A fill:#e1f5fe
style G fill:#e8f5e8
🚀 Quick Start¶
CLI Generation¶
# Interactive generation setup
llmbuilder generate text --setup
# Direct generation
llmbuilder generate text \
--model ./model/model.pt \
--tokenizer ./tokenizer \
--prompt "The future of AI is" \
--max-tokens 100 \
--temperature 0.8
# Interactive chat mode
llmbuilder generate text \
--model ./model/model.pt \
--tokenizer ./tokenizer \
--interactive
Python API Generation¶
import llmbuilder as lb
# Simple generation
text = lb.generate_text(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
prompt="The future of AI is",
max_new_tokens=100,
temperature=0.8
)
print(text)
# Interactive generation
lb.interactive_cli(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
temperature=0.8
)
⚙️ Generation Parameters¶
Core Parameters¶
from llmbuilder.inference import GenerationConfig
config = GenerationConfig(
# Length control
max_new_tokens=100, # Maximum tokens to generate
min_new_tokens=10, # Minimum tokens to generate
max_length=1024, # Total sequence length limit
# Sampling parameters
temperature=0.8, # Creativity (0.1-2.0)
top_k=50, # Top-k sampling
top_p=0.9, # Nucleus sampling
repetition_penalty=1.1, # Prevent repetition
# Special tokens
pad_token_id=0,
eos_token_id=2,
bos_token_id=1,
# Generation strategy
do_sample=True, # Use sampling vs greedy
num_beams=1, # Beam search width
early_stopping=True # Stop at EOS token
)
Advanced Parameters¶
config = GenerationConfig(
# Advanced sampling
typical_p=0.95, # Typical sampling
eta_cutoff=1e-4, # Eta sampling cutoff
epsilon_cutoff=1e-4, # Epsilon sampling cutoff
# Repetition control
repetition_penalty=1.1,
no_repeat_ngram_size=3, # Prevent n-gram repetition
encoder_repetition_penalty=1.0,
# Length penalties
length_penalty=1.0, # Beam search length penalty
exponential_decay_length_penalty=None,
# Diversity
num_beam_groups=1, # Diverse beam search
diversity_penalty=0.0,
# Stopping criteria
max_time=None, # Maximum generation time
stop_strings=["</s>", "\n\n"], # Custom stop strings
)
🎨 Sampling Strategies¶
1. Greedy Decoding¶
Always choose the most likely token:
config = GenerationConfig(
do_sample=False, # Disable sampling
temperature=1.0, # Not used in greedy
top_k=None, # Not used in greedy
top_p=None # Not used in greedy
)
text = lb.generate_text(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
prompt="Machine learning is",
config=config
)
Use cases:
- Deterministic output needed
- Factual question answering
- Code generation
- Translation tasks
2. Temperature Sampling¶
Control randomness with temperature:
# Conservative (more predictable)
conservative_config = GenerationConfig(
temperature=0.3, # Low temperature
do_sample=True
)
# Balanced
balanced_config = GenerationConfig(
temperature=0.8, # Medium temperature
do_sample=True
)
# Creative (more diverse)
creative_config = GenerationConfig(
temperature=1.5, # High temperature
do_sample=True
)
Temperature effects:
- 0.1-0.3: Very focused, predictable
- 0.5-0.8: Balanced creativity
- 1.0-1.5: More creative, diverse
- 1.5+: Very creative, potentially incoherent
3. Top-k Sampling¶
Sample from top k most likely tokens:
config = GenerationConfig(
do_sample=True,
temperature=0.8,
top_k=40, # Consider top 40 tokens
top_p=None # Disable nucleus sampling
)
Top-k values:
- 1: Greedy decoding
- 10-20: Conservative sampling
- 40-100: Balanced sampling
- 200+: Very diverse sampling
4. Top-p (Nucleus) Sampling¶
Sample from tokens that make up top p probability mass:
config = GenerationConfig(
do_sample=True,
temperature=0.8,
top_k=None, # Disable top-k
top_p=0.9 # Use top 90% probability mass
)
Top-p values:
- 0.1-0.3: Very focused
- 0.5-0.7: Balanced
- 0.8-0.95: Diverse
- 0.95+: Very diverse
5. Combined Sampling¶
Combine multiple strategies:
config = GenerationConfig(
do_sample=True,
temperature=0.8, # Add randomness
top_k=50, # Limit to top 50 tokens
top_p=0.9, # Within 90% probability mass
repetition_penalty=1.1 # Reduce repetition
)
🎯 Generation Modes¶
1. Single Generation¶
Generate one response to a prompt:
response = lb.generate_text(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
prompt="Explain quantum computing in simple terms:",
max_new_tokens=200,
temperature=0.7
)
2. Batch Generation¶
Generate multiple responses:
from llmbuilder.inference import batch_generate
prompts = [
"The benefits of renewable energy are",
"Artificial intelligence will help us",
"The future of space exploration includes"
]
responses = batch_generate(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
prompts=prompts,
max_new_tokens=100,
temperature=0.8,
batch_size=8
)
for prompt, response in zip(prompts, responses):
print(f"Prompt: {prompt}")
print(f"Response: {response}\n")
3. Interactive Generation¶
Real-time conversation mode:
from llmbuilder.inference import InteractiveGenerator
generator = InteractiveGenerator(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
config=GenerationConfig(
temperature=0.8,
top_k=50,
max_new_tokens=150
)
)
# Start interactive session
generator.start_session()
# Or use in code
while True:
prompt = input("You: ")
if prompt.lower() == 'quit':
break
response = generator.generate(prompt)
print(f"AI: {response}")
4. Streaming Generation¶
Generate text token by token:
from llmbuilder.inference import stream_generate
for token in stream_generate(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
prompt="The history of artificial intelligence",
max_new_tokens=200,
temperature=0.8
):
print(token, end='', flush=True)
🔧 Advanced Generation Features¶
1. Prompt Engineering¶
Optimize prompts for better results:
# System prompt + user prompt
system_prompt = "You are a helpful AI assistant that provides accurate and concise answers."
user_prompt = "Explain machine learning in simple terms."
full_prompt = f"System: {system_prompt}\nUser: {user_prompt}\nAssistant:"
response = lb.generate_text(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
prompt=full_prompt,
max_new_tokens=200
)
2. Few-shot Learning¶
Provide examples in the prompt:
few_shot_prompt = """
Translate English to French:
English: Hello, how are you?
French: Bonjour, comment allez-vous?
English: What is your name?
French: Comment vous appelez-vous?
English: I love programming.
French:"""
response = lb.generate_text(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
prompt=few_shot_prompt,
max_new_tokens=50,
temperature=0.3 # Lower temperature for translation
)
3. Constrained Generation¶
Generate text with constraints:
from llmbuilder.inference import ConstrainedGenerator
# Generate text that must contain certain words
generator = ConstrainedGenerator(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
required_words=["machine learning", "neural networks", "data"],
forbidden_words=["impossible", "never"],
max_length=200
)
response = generator.generate("Explain AI technology:")
4. Format-Specific Generation¶
Generate structured output:
# JSON generation
json_prompt = """Generate a JSON object describing a person:
{
"name": "John Smith",
"age": 30,
"occupation": "Software Engineer",
"skills": ["Python", "JavaScript", "Machine Learning"]
}
Generate a similar JSON for a data scientist:
{"""
response = lb.generate_text(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
prompt=json_prompt,
max_new_tokens=150,
temperature=0.3,
stop_strings=["}"]
)
# Add closing brace
complete_json = response + "}"
📊 Generation Quality Control¶
1. Output Filtering¶
Filter generated content:
from llmbuilder.inference import OutputFilter
filter_config = {
"min_length": 20, # Minimum response length
"max_repetition": 0.3, # Maximum repetition ratio
"profanity_filter": True, # Filter inappropriate content
"coherence_threshold": 0.7, # Minimum coherence score
"factuality_check": True # Basic fact checking
}
filtered_response = OutputFilter.filter(response, filter_config)
2. Quality Metrics¶
Evaluate generation quality:
from llmbuilder.inference import evaluate_generation
metrics = evaluate_generation(
generated_text=response,
reference_text=None, # Optional reference
prompt=prompt
)
print(f"Coherence: {metrics.coherence:.3f}")
print(f"Fluency: {metrics.fluency:.3f}")
print(f"Relevance: {metrics.relevance:.3f}")
print(f"Diversity: {metrics.diversity:.3f}")
print(f"Repetition: {metrics.repetition:.3f}")
3. A/B Testing¶
Compare different generation settings:
from llmbuilder.inference import compare_generations
configs = [
GenerationConfig(temperature=0.7, top_k=40),
GenerationConfig(temperature=0.8, top_p=0.9),
GenerationConfig(temperature=0.9, top_k=100, top_p=0.95)
]
results = compare_generations(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
prompt="Explain the benefits of renewable energy:",
configs=configs,
num_samples=10
)
for i, result in enumerate(results):
print(f"Config {i+1}: Quality={result.avg_quality:.3f}, Diversity={result.avg_diversity:.3f}")
🎮 Interactive Features¶
1. Chat Interface¶
Create a chat-like experience:
from llmbuilder.inference import ChatInterface
chat = ChatInterface(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
system_prompt="You are a helpful AI assistant.",
config=GenerationConfig(temperature=0.8, max_new_tokens=200)
)
# Start chat session
chat.start()
# Or use programmatically
conversation = []
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
response = chat.respond(user_input, conversation)
conversation.append({"user": user_input, "assistant": response})
print(f"AI: {response}")
2. Creative Writing Assistant¶
Specialized interface for creative writing:
from llmbuilder.inference import CreativeWriter
writer = CreativeWriter(
model_path="./model/model.pt",
tokenizer_path="./tokenizer",
style="creative",
config=GenerationConfig(temperature=1.0, top_p=0.95)
)
# Story continuation
story_start = "It was a dark and stormy night when Sarah discovered the mysterious letter..."
continuation = writer.continue_story(story_start, length=300)
# Character development
character = writer.develop_character("a brilliant but eccentric scientist")
# Dialogue generation
dialogue = writer.generate_dialogue("two friends discussing their dreams", turns=6)
🚨 Troubleshooting¶
Common Issues¶
Repetitive Output¶
# Solution: Adjust repetition penalty and sampling
config = GenerationConfig(
repetition_penalty=1.2, # Higher penalty
no_repeat_ngram_size=3, # Prevent 3-gram repetition
temperature=0.9, # Higher temperature
top_p=0.9 # Use nucleus sampling
)
Incoherent Output¶
# Solution: Lower temperature and use top-k
config = GenerationConfig(
temperature=0.6, # Lower temperature
top_k=40, # Limit choices
top_p=0.8, # Conservative nucleus
max_new_tokens=100 # Shorter responses
)
Too Conservative Output¶
# Solution: Increase temperature and sampling diversity
config = GenerationConfig(
temperature=1.0, # Higher temperature
top_k=100, # More choices
top_p=0.95, # Broader nucleus
repetition_penalty=1.1 # Slight repetition penalty
)
Slow Generation¶
# Solution: Optimize for speed
config = GenerationConfig(
max_new_tokens=50, # Shorter responses
do_sample=False, # Use greedy decoding
use_cache=True, # Enable KV cache
batch_size=1 # Single sample
)
# Use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
📚 Best Practices¶
1. Parameter Selection¶
- Start with temperature=0.8, top_k=50, top_p=0.9
- Adjust based on your specific use case
- Lower temperature for factual content
- Higher temperature for creative content
2. Prompt Engineering¶
- Be specific and clear in your prompts
- Use examples for complex tasks
- Include context and constraints
- Test different prompt formats
3. Quality Control¶
- Always validate generated content
- Use appropriate filtering for your use case
- Monitor for bias and inappropriate content
- Test with diverse inputs
4. Performance Optimization¶
- Use appropriate batch sizes
- Enable GPU acceleration when available
- Cache models for repeated use
- Consider quantization for deployment
Generation Tips
- Experiment with different parameter combinations to find what works best
- Use lower temperatures for factual tasks and higher for creative tasks
- Always validate generated content before using in production
- Consider the trade-off between quality and speed for your use case
- Keep prompts clear and specific for better results