FunSearch: Harnessing LLM Creativity Through Evaluation Loops

Introduction

Large Language Models (LLMs) like GPT-4, LLaMA, and Claude are known to occasionally produce “hallucinations” — outputs that are creative but factually incorrect or inconsistent. While problematic in factual question-answering contexts, this inherent creativity can be systematically leveraged to solve complex computational problems.

FunSearch (Function Search) represents a paradigm shift in how we utilize LLMs. Rather than treating hallucinations as a flaw, this approach harnesses the model’s creative outputs in a controlled, iterative process to discover novel solutions to challenging problems. The core insight: pair an LLM’s generative capabilities with a rigorous evaluation mechanism that can objectively verify the quality of generated solutions.

At its core, FunSearch implements an “evolutionary computation” framework:

Generation: The LLM proposes candidate solutions based on a prompt
Evaluation: An external system rigorously tests these solutions against objective criteria
Feedback: Results from evaluation inform the next generation attempt
Refinement: The process iterates until a satisfactory solution emerges

This approach has proven effective for diverse challenges ranging from algorithm design and mathematical proofs to complex optimization problems where traditional approaches fall short.

1. Core Concept and Workflow

The FunSearch Process

Problem Definition
- A clear problem statement is formulated, typically as a programming task, mathematical proof, or optimization challenge
- Evaluation criteria are explicitly defined (correctness, efficiency, performance metrics)
LLM Generation Phase
- The LLM produces multiple candidate solutions based on the problem description
- The generation can be guided by examples, constraints, or prior attempts
Rigorous Evaluation
- A separate system (executor, test suite, symbolic validator) objectively assesses each candidate
- For code: compilation success, test case performance, complexity analysis
- For mathematical proofs: logical consistency, completeness
Structured Feedback
- The evaluator produces specific, actionable feedback on each candidate
- Error messages, performance metrics, logical gaps are identified
Guided Regeneration
- The original prompt is augmented with evaluation feedback
- The LLM produces refined solutions addressing previous shortcomings
Iteration and Convergence
- The loop continues until either:
  - A solution meeting all criteria is found
  - A maximum iteration limit is reached
  - Progress plateaus across multiple iterations

High-Level Flow Diagram

2. Implementation Deep Dive: Practical FunSearch

The following Python implementation demonstrates a complete FunSearch system for code generation tasks. This example goes beyond conceptual explanation to provide a working framework you can adapt for your own projects:

import subprocess
import json
import time
import random
from typing import Dict, Tuple, List, Optional, Any

# In a real implementation, you would use an actual LLM API
# Example with OpenAI (you would need to install the openai package)
import openai

class FunSearch:
    """
    A complete implementation of the FunSearch approach for code generation
    tasks, using an evaluation-based feedback loop.
    """
    
    def __init__(
        self, 
        api_key: str, 
        model: str = "gpt-4", 
        max_iterations: int = 5,
        temperature: float = 0.7,
        timeout_seconds: int = 10
    ):
        """
        Initialize the FunSearch system.
        
        Args:
            api_key: API key for the LLM service
            model: Model identifier to use
            max_iterations: Maximum number of refinement attempts
            temperature: Creativity parameter for generation (0.0-1.0)
            timeout_seconds: Maximum execution time for candidate solutions
        """
        openai.api_key = api_key
        self.model = model
        self.max_iterations = max_iterations
        self.temperature = temperature
        self.timeout_seconds = timeout_seconds
        self.history = []
        
    def generate_solution(self, system_prompt: str, user_prompt: str) -> str:
        """
        Generate a candidate solution using the configured LLM.
        
        Args:
            system_prompt: Instructions for the model's role and task
            user_prompt: The specific problem description and requirements
            
        Returns:
            Generated code as a string
        """
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
        
        # Add conversation history for context in later iterations
        for entry in self.history:
            messages.append({"role": "assistant", "content": entry["solution"]})
            messages.append({"role": "user", "content": entry["feedback"]})
        
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=messages,
            temperature=self.temperature,
            max_tokens=2000
        )
        
        return response.choices[0].message.content
    
    def evaluate_python_solution(
        self, 
        code: str, 
        test_cases: List[Dict[str, Any]]
    ) -> Tuple[bool, str, Dict]:
        """
        Evaluate a Python solution against multiple test cases.
        
        Args:
            code: The Python code to evaluate
            test_cases: List of test cases with inputs and expected outputs
            
        Returns:
            Tuple containing:
            - Success flag (True/False)
            - Feedback message
            - Detailed results including performance metrics
        """
        # Save the code to a temporary file
        with open("candidate_solution.py", "w") as f:
            f.write(code)
        
        results = {
            "passed": 0,
            "failed": 0,
            "errors": [],
            "execution_times": []
        }
        
        # Run each test case
        for i, test in enumerate(test_cases):
            test_input = test.get("input", "")
            expected_output = test.get("expected", "")
            
            try:
                # Time the execution
                start_time = time.time()
                
                process = subprocess.Popen(
                    ["python", "candidate_solution.py"],
                    stdin=subprocess.PIPE,
                    stdout=subprocess.PIPE,
                    stderr=subprocess.PIPE,
                    text=True
                )
                
                # Provide input and capture output
                stdout, stderr = process.communicate(
                    input=str(test_input), 
                    timeout=self.timeout_seconds
                )
                
                execution_time = time.time() - start_time
                results["execution_times"].append(execution_time)
                
                # Check for errors
                if stderr:
                    results["failed"] += 1
                    results["errors"].append({
                        "test_case": i,
                        "type": "Runtime Error",
                        "message": stderr
                    })
                    continue
                    
                # Compare with expected output
                if stdout.strip() == str(expected_output).strip():
                    results["passed"] += 1
                else:
                    results["failed"] += 1
                    results["errors"].append({
                        "test_case": i,
                        "type": "Output Mismatch",
                        "expected": expected_output,
                        "actual": stdout.strip()
                    })
                    
            except subprocess.TimeoutExpired:
                results["failed"] += 1
                results["errors"].append({
                    "test_case": i,
                    "type": "Timeout",
                    "message": f"Execution exceeded {self.timeout_seconds} seconds"
                })
        
        # Determine overall success
        success = results["failed"] == 0 and results["passed"] == len(test_cases)
        
        # Generate human-readable feedback
        if success:
            feedback = f"Success! All {results['passed']} test cases passed."
            if results["execution_times"]:
                avg_time = sum(results["execution_times"]) / len(results["execution_times"])
                feedback += f" Average execution time: {avg_time:.4f} seconds."
        else:
            feedback = f"Failed {results['failed']} out of {len(test_cases)} test cases.\n"
            for error in results["errors"]:
                feedback += f"- Test case {error['test_case']}: {error['type']}\n"
                if error['type'] == "Output Mismatch":
                    feedback += f"  Expected: {error['expected']}\n"
                    feedback += f"  Actual: {error['actual']}\n"
                elif 'message' in error:
                    feedback += f"  {error['message']}\n"
        
        return success, feedback, results
    
    def search(
        self, 
        problem_description: str, 
        test_cases: List[Dict[str, Any]]
    ) -> Dict:
        """
        Execute the full FunSearch process to find a solution.
        
        Args:
            problem_description: Detailed description of the problem to solve
            test_cases: List of test cases for evaluation
            
        Returns:
            Dictionary with the best solution and full history
        """
        system_prompt = (
            "You are an expert Python programmer. Your task is to write correct, "
            "efficient, and clean Python code that solves the given problem. "
            "Focus on correctness first, then efficiency."
        )
        
        best_solution = None
        best_results = None
        
        print(f"Starting FunSearch with {self.max_iterations} maximum iterations")
        
        for iteration in range(self.max_iterations):
            print(f"\n--- Iteration {iteration+1}/{self.max_iterations} ---")
            
            # Generate a solution
            code = self.generate_solution(system_prompt, problem_description)
            print(f"\nGenerated solution:\n{'-'*20}\n{code}\n{'-'*20}")
            
            # Evaluate the solution
            success, feedback, results = self.evaluate_python_solution(code, test_cases)
            print(f"\nEvaluation feedback: {feedback}")
            
            # Record this attempt
            attempt = {
                "iteration": iteration + 1,
                "solution": code,
                "success": success,
                "feedback": feedback,
                "results": results
            }
            self.history.append(attempt)
            
            # Check if we found a successful solution
            if success:
                print("\n✅ Found successful solution!")
                best_solution = code
                best_results = results
                break
            
            # Update the problem description with feedback for the next iteration
            problem_description = (
                f"{problem_description}\n\n"
                f"My previous attempt had issues:\n{feedback}\n\n"
                "Please fix these issues and try again."
            )
        
        return {
            "success": best_solution is not None,
            "best_solution": best_solution,
            "best_results": best_results,
            "iterations_used": len(self.history),
            "history": self.history
        }


# Example usage:
if __name__ == "__main__":
    # Replace with your actual API key
    API_KEY = "your_openai_api_key_here"
    
    funsearch = FunSearch(
        api_key=API_KEY,
        model="gpt-4",
        max_iterations=3,
        temperature=0.7
    )
    
    problem = """
    Write a Python program that reads a number N from stdin and calculates its factorial.
    
    Requirements:
    1. The program should handle positive integers up to 20
    2. For inputs that aren't positive integers, print "Invalid input"
    3. The output should only contain the resulting factorial value, nothing else
    4. Do not use any external libraries beyond Python's built-ins
    """
    
    test_cases = [
        {"input": "5", "expected": "120"},
        {"input": "0", "expected": "1"},
        {"input": "20", "expected": "2432902008176640000"},
        {"input": "-3", "expected": "Invalid input"},
        {"input": "abc", "expected": "Invalid input"}
    ]
    
    result = funsearch.search(problem, test_cases)
    
    # Report final results
    if result["success"]:
        print("\n=== SUCCESSFUL SOLUTION FOUND ===")
        print(f"Required {result['iterations_used']} iterations")
        print("\nFinal solution:")
        print(result["best_solution"])
    else:
        print("\n=== NO SUCCESSFUL SOLUTION FOUND ===")
        print("Best attempt:")
        best_attempt = max(result["history"], key=lambda x: x["results"]["passed"])
        print(f"Passed {best_attempt['results']['passed']} of {len(test_cases)} test cases")

This implementation includes several advanced features: – Complete tracking of solution history – Detailed performance metrics – Structured test case evaluation – Intelligent feedback generation – Conversation-style prompt refinement

Visualization of the Iteration Process

3. Real-World Applications and Breakthroughs

FunSearch has demonstrated impressive capabilities beyond academic exercises. Here are some notable applications:

Bin Packing Algorithms

Researchers at DeepMind used FunSearch to discover new algorithms for the bin packing problem—a classic NP-hard optimization challenge with applications in logistics, manufacturing, and resource allocation. The system discovered a novel heuristic that outperformed human-designed algorithms on certain problem instances.

Mathematical Conjectures

FunSearch has been applied to explore mathematical conjectures by generating potential counterexamples or supporting evidence. In one landmark study, it helped investigate patterns in combinatorial mathematics that had eluded traditional approaches.

Circuit Design Optimization

Engineers have applied FunSearch principles to optimize electronic circuit designs, where the evaluator simulates circuit performance against specifications. This has resulted in novel arrangements that human designers overlooked.

Drug Discovery

In pharmaceutical research, modified FunSearch approaches help generate molecular structures with specific properties, with evaluation performed through chemical simulation software.

4. Resource Optimization Strategies

Running FunSearch systems efficiently requires careful resource management:

Model Quantization and Optimization

Int8/Int4 Quantization: Reduce model precision without significantly impacting generation quality
Pruning: Remove unnecessary weights to create leaner models for iteration
Knowledge Distillation: Train smaller models that approximate larger ones

Example Implementation:

# Using bitsandbytes for quantization
import bitsandbytes as bnb
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_8bit=True,      # Enable 8-bit quantization
    device_map="auto"       # Automatically manages device placement
)

Batch Processing and Parallelization

Generate multiple candidate solutions in parallel
Evaluate candidates concurrently using multiprocessing
Cache evaluation results to avoid redundant computation

Evaluation Optimization

Implement early stopping for clearly failing candidates
Use progressive evaluation (test simple cases first)
Profile and optimize evaluation code for maximum throughput

5. Advanced Techniques and Extensions

Multi-Agent FunSearch

Recent research has explored using multiple specialized LLMs in the loop: – Generator: Creates diverse solution candidates – Critic: Analyzes generated solutions for potential issues before evaluation – Refiner: Specializes in fixing identified problems – Evaluator: Provides objective assessments

This approach mimics a collaborative human team, where different expertise is brought to bear on complex problems.

Self-Improvement Mechanisms

Advanced FunSearch systems can incorporate: – Meta-learning: The system learns which types of solutions tend to work better over time – Curriculum learning: Starting with simpler instances and progressively tackling harder versions – Memory banks: Storing successful patterns from previous runs to inform new generations

Hybrid Symbolic-Neural Approaches

Combining FunSearch with symbolic reasoning: – Use LLMs for creative generation – Employ symbolic solvers (SAT/SMT solvers, theorem provers) for rigorous verification – Create a complementary system leveraging the strengths of both approaches

6. Limitations and Ethical Considerations

Computational Costs

Repeated LLM generations can be expensive, especially with larger models
Evaluations for complex problems may require significant compute resources
Consider environmental impact of extensive computation

Solution Verifiability

For some domains, exhaustive testing may be impossible
Formal verification might be necessary for critical applications
Always validate FunSearch solutions carefully before deployment

Intellectual Property Questions

Who owns solutions discovered by FunSearch?
Are these solutions derivative works or novel inventions?
Consider attribution and licensing implications

Conclusion and Future Directions

FunSearch represents a paradigm shift in computational problem-solving, turning LLM limitations into strengths. By coupling creative generation with rigorous evaluation, we create systems capable of discovering novel solutions to complex problems.

The future of this field looks promising, with several exciting directions:

Integration with specialized domain tools (physics simulators, formal verifiers, etc.)
Self-evolving evaluation criteria where the system learns to judge solutions better over time
Cross-domain transfers where insights from one problem space inform solutions in another
Human-AI collaborative loops where human experts work alongside FunSearch systems

As LLMs continue to advance in capabilities while becoming more accessible and efficient, FunSearch approaches will likely become standard components in computational research, creative problem-solving, and automated innovation pipelines.

By implementing the patterns described in this article, you can begin leveraging this powerful paradigm for your own challenging computational problems today.

Posted in AI / ML, LLM Advanced, LLM Research by Rakshit Kalra

Write a comment