Text Generation¶

Text generation is where your trained language model comes to life. LLMBuilder provides powerful and flexible text generation capabilities with various sampling strategies, interactive modes, and customization options.

🎯 Generation Overview¶

Text generation transforms your trained model into a creative writing assistant:

graph LR
    A[Prompt] --> B[Tokenizer]
    B --> C[Model]
    C --> D[Sampling Strategy]
    D --> E[Generated Tokens]
    E --> F[Detokenizer]
    F --> G[Generated Text]

    D --> D1[Greedy]
    D --> D2[Top-k]
    D --> D3[Top-p]
    D --> D4[Temperature]

    style A fill:#e1f5fe
    style G fill:#e8f5e8

🚀 Quick Start¶

CLI Generation¶

# Interactive generation setup
llmbuilder generate text --setup

# Direct generation
llmbuilder generate text \
  --model ./model/model.pt \
  --tokenizer ./tokenizer \
  --prompt "The future of AI is" \
  --max-tokens 100 \
  --temperature 0.8

# Interactive chat mode
llmbuilder generate text \
  --model ./model/model.pt \
  --tokenizer ./tokenizer \
  --interactive

Python API Generation¶

import llmbuilder as lb

# Simple generation
text = lb.generate_text(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    prompt="The future of AI is",
    max_new_tokens=100,
    temperature=0.8
)
print(text)

# Interactive generation
lb.interactive_cli(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    temperature=0.8
)

⚙️ Generation Parameters¶

Core Parameters¶

from llmbuilder.inference import GenerationConfig

config = GenerationConfig(
    # Length control
    max_new_tokens=100,         # Maximum tokens to generate
    min_new_tokens=10,          # Minimum tokens to generate
    max_length=1024,            # Total sequence length limit

    # Sampling parameters
    temperature=0.8,            # Creativity (0.1-2.0)
    top_k=50,                   # Top-k sampling
    top_p=0.9,                  # Nucleus sampling
    repetition_penalty=1.1,     # Prevent repetition

    # Special tokens
    pad_token_id=0,
    eos_token_id=2,
    bos_token_id=1,

    # Generation strategy
    do_sample=True,             # Use sampling vs greedy
    num_beams=1,                # Beam search width
    early_stopping=True         # Stop at EOS token
)

Advanced Parameters¶

config = GenerationConfig(
    # Advanced sampling
    typical_p=0.95,             # Typical sampling
    eta_cutoff=1e-4,            # Eta sampling cutoff
    epsilon_cutoff=1e-4,        # Epsilon sampling cutoff

    # Repetition control
    repetition_penalty=1.1,
    no_repeat_ngram_size=3,     # Prevent n-gram repetition
    encoder_repetition_penalty=1.0,

    # Length penalties
    length_penalty=1.0,         # Beam search length penalty
    exponential_decay_length_penalty=None,

    # Diversity
    num_beam_groups=1,          # Diverse beam search
    diversity_penalty=0.0,

    # Stopping criteria
    max_time=None,              # Maximum generation time
    stop_strings=["</s>", "\n\n"],  # Custom stop strings
)

🎨 Sampling Strategies¶

1. Greedy Decoding¶

Always choose the most likely token:

config = GenerationConfig(
    do_sample=False,            # Disable sampling
    temperature=1.0,            # Not used in greedy
    top_k=None,                 # Not used in greedy
    top_p=None                  # Not used in greedy
)

text = lb.generate_text(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    prompt="Machine learning is",
    config=config
)

Use cases:

Deterministic output needed
Factual question answering
Code generation
Translation tasks

2. Temperature Sampling¶

Control randomness with temperature:

# Conservative (more predictable)
conservative_config = GenerationConfig(
    temperature=0.3,            # Low temperature
    do_sample=True
)

# Balanced
balanced_config = GenerationConfig(
    temperature=0.8,            # Medium temperature
    do_sample=True
)

# Creative (more diverse)
creative_config = GenerationConfig(
    temperature=1.5,            # High temperature
    do_sample=True
)

Temperature effects:

0.1-0.3: Very focused, predictable
0.5-0.8: Balanced creativity
1.0-1.5: More creative, diverse
1.5+: Very creative, potentially incoherent

3. Top-k Sampling¶

Sample from top k most likely tokens:

config = GenerationConfig(
    do_sample=True,
    temperature=0.8,
    top_k=40,                   # Consider top 40 tokens
    top_p=None                  # Disable nucleus sampling
)

Top-k values:

1: Greedy decoding
10-20: Conservative sampling
40-100: Balanced sampling
200+: Very diverse sampling

4. Top-p (Nucleus) Sampling¶

Sample from tokens that make up top p probability mass:

config = GenerationConfig(
    do_sample=True,
    temperature=0.8,
    top_k=None,                 # Disable top-k
    top_p=0.9                   # Use top 90% probability mass
)

Top-p values:

0.1-0.3: Very focused
0.5-0.7: Balanced
0.8-0.95: Diverse
0.95+: Very diverse

5. Combined Sampling¶

Combine multiple strategies:

config = GenerationConfig(
    do_sample=True,
    temperature=0.8,            # Add randomness
    top_k=50,                   # Limit to top 50 tokens
    top_p=0.9,                  # Within 90% probability mass
    repetition_penalty=1.1      # Reduce repetition
)

🎯 Generation Modes¶

1. Single Generation¶

Generate one response to a prompt:

response = lb.generate_text(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    prompt="Explain quantum computing in simple terms:",
    max_new_tokens=200,
    temperature=0.7
)

2. Batch Generation¶

Generate multiple responses:

from llmbuilder.inference import batch_generate

prompts = [
    "The benefits of renewable energy are",
    "Artificial intelligence will help us",
    "The future of space exploration includes"
]

responses = batch_generate(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    prompts=prompts,
    max_new_tokens=100,
    temperature=0.8,
    batch_size=8
)

for prompt, response in zip(prompts, responses):
    print(f"Prompt: {prompt}")
    print(f"Response: {response}\n")

3. Interactive Generation¶

Real-time conversation mode:

from llmbuilder.inference import InteractiveGenerator

generator = InteractiveGenerator(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    config=GenerationConfig(
        temperature=0.8,
        top_k=50,
        max_new_tokens=150
    )
)

# Start interactive session
generator.start_session()

# Or use in code
while True:
    prompt = input("You: ")
    if prompt.lower() == 'quit':
        break

    response = generator.generate(prompt)
    print(f"AI: {response}")

4. Streaming Generation¶

Generate text token by token:

from llmbuilder.inference import stream_generate

for token in stream_generate(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    prompt="The history of artificial intelligence",
    max_new_tokens=200,
    temperature=0.8
):
    print(token, end='', flush=True)

🔧 Advanced Generation Features¶

1. Prompt Engineering¶

Optimize prompts for better results:

# System prompt + user prompt
system_prompt = "You are a helpful AI assistant that provides accurate and concise answers."
user_prompt = "Explain machine learning in simple terms."

full_prompt = f"System: {system_prompt}\nUser: {user_prompt}\nAssistant:"

response = lb.generate_text(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    prompt=full_prompt,
    max_new_tokens=200
)

2. Few-shot Learning¶

Provide examples in the prompt:

few_shot_prompt = """
Translate English to French:

English: Hello, how are you?
French: Bonjour, comment allez-vous?

English: What is your name?
French: Comment vous appelez-vous?

English: I love programming.
French:"""

response = lb.generate_text(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    prompt=few_shot_prompt,
    max_new_tokens=50,
    temperature=0.3  # Lower temperature for translation
)

3. Constrained Generation¶

Generate text with constraints:

from llmbuilder.inference import ConstrainedGenerator

# Generate text that must contain certain words
generator = ConstrainedGenerator(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    required_words=["machine learning", "neural networks", "data"],
    forbidden_words=["impossible", "never"],
    max_length=200
)

response = generator.generate("Explain AI technology:")

4. Format-Specific Generation¶

Generate structured output:

# JSON generation
json_prompt = """Generate a JSON object describing a person:
{
  "name": "John Smith",
  "age": 30,
  "occupation": "Software Engineer",
  "skills": ["Python", "JavaScript", "Machine Learning"]
}

Generate a similar JSON for a data scientist:
{"""

response = lb.generate_text(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    prompt=json_prompt,
    max_new_tokens=150,
    temperature=0.3,
    stop_strings=["}"]
)

# Add closing brace
complete_json = response + "}"

📊 Generation Quality Control¶

1. Output Filtering¶

Filter generated content:

from llmbuilder.inference import OutputFilter

filter_config = {
    "min_length": 20,           # Minimum response length
    "max_repetition": 0.3,      # Maximum repetition ratio
    "profanity_filter": True,   # Filter inappropriate content
    "coherence_threshold": 0.7, # Minimum coherence score
    "factuality_check": True    # Basic fact checking
}

filtered_response = OutputFilter.filter(response, filter_config)

2. Quality Metrics¶

Evaluate generation quality:

from llmbuilder.inference import evaluate_generation

metrics = evaluate_generation(
    generated_text=response,
    reference_text=None,        # Optional reference
    prompt=prompt
)

print(f"Coherence: {metrics.coherence:.3f}")
print(f"Fluency: {metrics.fluency:.3f}")
print(f"Relevance: {metrics.relevance:.3f}")
print(f"Diversity: {metrics.diversity:.3f}")
print(f"Repetition: {metrics.repetition:.3f}")

3. A/B Testing¶

Compare different generation settings:

from llmbuilder.inference import compare_generations

configs = [
    GenerationConfig(temperature=0.7, top_k=40),
    GenerationConfig(temperature=0.8, top_p=0.9),
    GenerationConfig(temperature=0.9, top_k=100, top_p=0.95)
]

results = compare_generations(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    prompt="Explain the benefits of renewable energy:",
    configs=configs,
    num_samples=10
)

for i, result in enumerate(results):
    print(f"Config {i+1}: Quality={result.avg_quality:.3f}, Diversity={result.avg_diversity:.3f}")

🎮 Interactive Features¶

1. Chat Interface¶

Create a chat-like experience:

from llmbuilder.inference import ChatInterface

chat = ChatInterface(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    system_prompt="You are a helpful AI assistant.",
    config=GenerationConfig(temperature=0.8, max_new_tokens=200)
)

# Start chat session
chat.start()

# Or use programmatically
conversation = []
while True:
    user_input = input("You: ")
    if user_input.lower() == 'quit':
        break

    response = chat.respond(user_input, conversation)
    conversation.append({"user": user_input, "assistant": response})
    print(f"AI: {response}")

2. Creative Writing Assistant¶

Specialized interface for creative writing:

from llmbuilder.inference import CreativeWriter

writer = CreativeWriter(
    model_path="./model/model.pt",
    tokenizer_path="./tokenizer",
    style="creative",
    config=GenerationConfig(temperature=1.0, top_p=0.95)
)

# Story continuation
story_start = "It was a dark and stormy night when Sarah discovered the mysterious letter..."
continuation = writer.continue_story(story_start, length=300)

# Character development
character = writer.develop_character("a brilliant but eccentric scientist")

# Dialogue generation
dialogue = writer.generate_dialogue("two friends discussing their dreams", turns=6)

🚨 Troubleshooting¶

Common Issues¶

Repetitive Output¶

# Solution: Adjust repetition penalty and sampling
config = GenerationConfig(
    repetition_penalty=1.2,     # Higher penalty
    no_repeat_ngram_size=3,     # Prevent 3-gram repetition
    temperature=0.9,            # Higher temperature
    top_p=0.9                   # Use nucleus sampling
)

Incoherent Output¶

# Solution: Lower temperature and use top-k
config = GenerationConfig(
    temperature=0.6,            # Lower temperature
    top_k=40,                   # Limit choices
    top_p=0.8,                  # Conservative nucleus
    max_new_tokens=100          # Shorter responses
)

Too Conservative Output¶

# Solution: Increase temperature and sampling diversity
config = GenerationConfig(
    temperature=1.0,            # Higher temperature
    top_k=100,                  # More choices
    top_p=0.95,                 # Broader nucleus
    repetition_penalty=1.1      # Slight repetition penalty
)

Slow Generation¶

# Solution: Optimize for speed
config = GenerationConfig(
    max_new_tokens=50,          # Shorter responses
    do_sample=False,            # Use greedy decoding
    use_cache=True,             # Enable KV cache
    batch_size=1                # Single sample
)

# Use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"

📚 Best Practices¶

1. Parameter Selection¶

Start with temperature=0.8, top_k=50, top_p=0.9
Adjust based on your specific use case
Lower temperature for factual content
Higher temperature for creative content

2. Prompt Engineering¶

Be specific and clear in your prompts
Use examples for complex tasks
Include context and constraints
Test different prompt formats

3. Quality Control¶

Always validate generated content
Use appropriate filtering for your use case
Monitor for bias and inappropriate content
Test with diverse inputs

4. Performance Optimization¶

Use appropriate batch sizes
Enable GPU acceleration when available
Cache models for repeated use
Consider quantization for deployment

Generation Tips

Experiment with different parameter combinations to find what works best
Use lower temperatures for factual tasks and higher for creative tasks
Always validate generated content before using in production
Consider the trade-off between quality and speed for your use case
Keep prompts clear and specific for better results