Unboxing LLMs > loading...

October 19, 2023

Graph of Thoughts: Solving Complex Problems with Large Language Models

Graph of Thoughts: Solving Complex Problems with Large Language Models

Executive Summary

Graph of Thoughts (GoT) is an advanced prompting paradigm that organizes LLM “thoughts” as vertices in a directed graph, with edges representing dependencies between thoughts. This approach substantially outperforms traditional methods like Chain-of-Thought (CoT) prompting when solving complex, multi-step problems. GoT enables non-linear reasoning through merging parallel ideas, revisiting previous states, and refining existing solutions iteratively—capabilities impossible in simpler linear or tree-based paradigms. This article presents a comprehensive technical overview of GoT implementation, complete with code examples and practical applications.

1. Introduction and Motivation

Large Language Models (LLMs) have shown remarkable performance on tasks from coding to summarization. Yet, as tasks grow more elaborate—involving multiple subtasks or requiring re-combination of partial solutions—simpler reasoning paradigms struggle with fundamental limitations:

  • Chain-of-Thought Prompting (CoT) Wei et al., 2022 organizes a linear “thought process,” but can’t revisit or merge separate lines of reasoning.
  • Self-Consistency Wang et al., 2023 samples multiple reasoning paths but only selects the most common answer without merging insights.
  • Tree-of-Thoughts Yao et al., 2023 allows branching exploration, but remains restricted to tree structures without the ability to freely merge, loop, or revisit.

These limitations matter particularly for tasks that require: – Combining partial solutions from different approaches – Iterative refinement with feedback loops – Revisiting earlier reasoning steps to correct errors – Managing complex interdependencies between subtasks

To address these limitations, Besta et al. (2023) propose a Graph of Thoughts (GoT) paradigm arXiv:2308.09687. GoT organizes the LLM’s intermediate outputs (i.e., “thoughts”) in a directed graph:
Vertices represent partial solutions (ideas, hypotheses, computation states).
Edges represent how each new thought derives from, builds upon, or merges older ones.

By adopting this flexible graph abstraction, you can combine partial solutions, re-check older states, or refine existing ideas more robustly. Empirical results show improved accuracy, less redundancy, and lower cost (token usage) compared to chain- or tree-based methods.

Chain of Thought (CoT)

2. High-Level Architecture

The GoT framework consists of four primary modules working in concert to maintain and expand a dynamic graph of intermediate reasoning states:

  1. Prompter: Prepares structured prompts for the LLM, including relevant subgraph contexts or specific nodes to expand. It transforms graph state into natural language that guides the LLM’s next reasoning step.
  2. Parser: Converts the raw LLM responses into structured “thought states” and updates the underlying graph. It extracts key information and maintains the graph’s integrity.
  3. Scoring & Validation: Assesses correctness, plausibility, or quality of each newly generated thought. This component filters out errors and prioritizes promising directions.
  4. Controller: Orchestrates the entire process by deciding which nodes to expand next, whether to merge partial solutions, or when to stop. It implements the reasoning strategy.

The system maintains two critical data structures:

  • Graph Reasoning State (GRS): Contains all nodes (thoughts), edges (dependencies), and relevant metadata such as scores, timestamps, and validity markers.
  • Graph of Operations (GoO): A user-defined plan that instructs the Controller on what transformations to apply at each step (e.g., “generate 3 expansions for each active node,” “merge top 2 candidates,” “refine node #4”).

This modular architecture allows for flexible implementation across various problem domains while maintaining a consistent reasoning approach.

Graph of Thoughts Architecture

3. Key Modules (with Code Snippets)

Below is a detailed implementation of the core GoT components in Python. These modules can be adapted to your specific use case.

3.1 Prompter

The Prompter builds text prompts for the LLM, integrating the current subgraph or active nodes. It transforms graph state into natural language instructions that the LLM can understand.

class Prompter:
    def __init__(self, templates):
        """
        Initialize with a dictionary of prompt templates for different operations.
        
        :param templates: Dict mapping operation types to template strings.
                         Each template can contain placeholders for dynamic data.
                         For example: {"generate": "Expand on these ideas: {node_list}",
                                      "merge": "Combine these concepts: {node_list}"}
        """
        self.templates = templates
    
    def create_prompt(self, operation_type, active_nodes, context=None):
        """
        Create a prompt for a specific operation type.
        
        :param operation_type: String indicating the operation (e.g., "generate", "merge")
        :param active_nodes: List of node dictionaries to include in the prompt
        :param context: Optional additional context for the prompt
        :return: A formatted prompt string ready for the LLM
        """
        if operation_type not in self.templates:
            raise ValueError(f"No template defined for operation: {operation_type}")
            
        template = self.templates[operation_type]
        
        # Format nodes for inclusion in the prompt
        node_text = ""
        for idx, node in enumerate(active_nodes):
            node_text += f"Node {idx}: {node['content']}\n"
            # Optionally include metadata that might help the LLM
            if 'score' in node and operation_type == 'refine':
                node_text += f"  Score: {node['score']}\n"
        
        # Add any additional context
        context_text = f"\nContext: {context}\n" if context else ""
        
        # Format the final prompt
        prompt = template.format(
            node_list=node_text,
            node_count=len(active_nodes),
            context=context_text
        )
        
        return prompt

Example template definitions that you might use:

prompt_templates = {
    "generate": "Based on the following thoughts, generate 3 new approaches or extensions:\n{node_list}\nProvide your responses as separate, numbered points.",
    
    "merge": "Combine the following {node_count} ideas into a single coherent solution:\n{node_list}\nEnsure your response integrates the strengths of each approach while addressing any contradictions.",
    
    "refine": "The following thought needs improvement:{node_list}\nRefine this idea to address any weaknesses, increase precision, or expand its applicability.",
    
    "validate": "Critically evaluate the following thought:{node_list}\nIdentify any logical errors, missing information, or assumptions. Rate its validity from 0-10 and explain your rating."
}

3.2 Parser

The Parser extracts structured information from LLM responses, updates the graph, and maintains data consistency:

class Parser:
    def __init__(self, operation_parsers=None):
        """
        Initialize with specialized parsers for different operations.
        
        :param operation_parsers: Dict mapping operation types to specialized parsing functions
        """
        self.default_parser = self._basic_parse
        self.operation_parsers = operation_parsers or {}
    
    def parse_llm_output(self, llm_output, operation_type):
        """
        Parse LLM output based on operation type.
        
        :param llm_output: Raw text response from the LLM
        :param operation_type: Type of operation that generated this output
        :return: List of thought dictionaries extracted from the output
        """
        # Use specialized parser if available, otherwise default
        parser = self.operation_parsers.get(operation_type, self.default_parser)
        return parser(llm_output)
    
    def _basic_parse(self, llm_output):
        """
        Basic parsing that treats each paragraph as a separate thought.
        
        :param llm_output: Raw text from LLM
        :return: List of thought dictionaries
        """
        paragraphs = [p.strip() for p in llm_output.split("\n\n") if p.strip()]
        thoughts = [{"content": p, "created_at": time.time()} for p in paragraphs]
        return thoughts
    
    def parse_json_format(self, llm_output):
        """
        Parse JSON-formatted LLM output (useful when LLM is instructed to return JSON).
        
        :param llm_output: JSON string from LLM
        :return: List of thought dictionaries
        """
        try:
            # Extract JSON from potentially mixed output
            import re
            import json
            
            json_match = re.search(r'```json\n(.*?)\n```', llm_output, re.DOTALL)
            if json_match:
                json_str = json_match.group(1)
            else:
                # Try to find any JSON-like structure
                json_str = re.search(r'\{.*\}', llm_output, re.DOTALL).group(0)
                
            parsed = json.loads(json_str)
            
            # Handle both list and single object responses
            if isinstance(parsed, list):
                thoughts = parsed
            else:
                thoughts = [parsed]
                
            # Ensure created_at timestamp
            for thought in thoughts:
                if 'created_at' not in thought:
                    thought['created_at'] = time.time()
                    
            return thoughts
            
        except (json.JSONDecodeError, AttributeError) as e:
            print(f"JSON parsing failed: {e}. Falling back to basic parsing.")
            return self._basic_parse(llm_output)

For optimal results, you can instruct the LLM to return structured outputs (like JSON) to make parsing more reliable. This is especially important for complex operations like merges.

3.3 Scoring & Validation

The scoring and validation module evaluates thoughts for quality, correctness, and relevance:

class ScoringValidator:
    def __init__(self, scoring_functions=None, domain_validators=None, llm=None, validation_prompt=None):
        """
        Initialize with custom scoring functions and optional LLM for validation.
        
        :param scoring_functions: Dict mapping score types to scoring functions
        :param domain_validators: Dict of domain-specific validation functions
        :param llm: Optional LLM client for validation scoring
        :param validation_prompt: Template for validation prompts if using LLM
        """
        # Basic scoring functions that work across domains
        self.scoring_functions = scoring_functions or {
            "length": lambda thought: len(thought["content"].split()),
            "complexity": lambda thought: len(set(thought["content"].split())) / max(1, len(thought["content"].split()))
        }
        
        # Domain-specific validators
        self.domain_validators = domain_validators or {}
        self.llm = llm
        self.validation_prompt = validation_prompt
    
    def score_and_validate(self, thought, metrics=None, domain=None):
        """
        Score a thought using specified metrics and validate it.
        
        :param thought: Thought dictionary to score
        :param metrics: List of metric names to calculate (if None, use all)
        :param domain: Optional domain for specialized validation
        :return: Updated thought with scores and validation result
        """
        metrics = metrics or list(self.scoring_functions.keys())
        
        # Apply each requested scoring function
        for metric in metrics:
            if metric in self.scoring_functions:
                thought[f"score_{metric}"] = self.scoring_functions[metric](thought)
        
        # Compute overall score (can be customized)
        if "score_overall" not in thought and metrics:
            thought["score_overall"] = sum(thought.get(f"score_{m}", 0) for m in metrics) / len(metrics)
        
        # Apply domain-specific validation if domain is specified
        if domain and domain in self.domain_validators:
            thought["valid"], validation_details = self.domain_validators[domain](thought)
            thought["validation_details"] = validation_details
        # Validate using LLM if available
        elif self.llm and self.validation_prompt:
            thought["valid"], thought["validation_details"] = self._validate_with_llm(thought)
        else:
            thought["valid"] = self._basic_validation(thought)
            
        return thought
    
    def _basic_validation(self, thought):
        """Simple validation based on content length and basic checks."""
        content = thought["content"]
        # Basic validation rules - customize for your use case
        min_length = 10  # words
        max_length = 1000  # words
        word_count = len(content.split())
        
        return word_count >= min_length and word_count <= max_length
    
    def _validate_with_llm(self, thought):
        """Use LLM to validate the thought."""
        prompt = self.validation_prompt.format(thought=thought["content"])
        response = self.llm(prompt)
        
        # Look for validation indicators in response
        valid_indicators = ["valid", "correct", "acceptable", "true"]
        invalid_indicators = ["invalid", "incorrect", "flawed", "false"]
        
        # Simple heuristic for validation
        lower_response = response.lower()
        valid_count = sum(1 for word in valid_indicators if word in lower_response)
        invalid_count = sum(1 for word in invalid_indicators if word in lower_response)
        
        return valid_count > invalid_count, response


# Example specialized validators for different domains

def create_math_validator():
    """Create a validator for mathematical reasoning tasks."""
    def validate_math_thought(thought):
        content = thought["content"]
        validation_details = {}
        
        # Check for numerical results
        validation_details["has_numbers"] = any(char.isdigit() for char in content)
        
        # Check for mathematical symbols
        math_symbols = ['+', '-', '*', '/', '=', '>', '<', '∫', '∑', '√']
        validation_details["has_math_symbols"] = any(sym in content for sym in math_symbols)
        
        # Check for common mathematical errors
        validation_details["division_by_zero"] = "division by zero" in content.lower() or "/0" in content
        
        # Overall validity check
        valid = (validation_details["has_numbers"] and 
                 validation_details["has_math_symbols"] and 
                 not validation_details["division_by_zero"])
        
        return valid, validation_details
    
    return validate_math_thought

def create_code_validator(language="python"):
    """Create a validator for code generation tasks."""
    def validate_code_thought(thought):
        content = thought["content"]
        validation_details = {}
        
        if language == "python":
            # Check for syntax errors
            try:
                import ast
                ast.parse(content)
                validation_details["syntax_valid"] = True
            except SyntaxError as e:
                validation_details["syntax_valid"] = False
                validation_details["syntax_error"] = str(e)
            
            # Check for common code smells
            validation_details["has_imports"] = "import " in content
            validation_details["potential_infinite_loop"] = "while True" in content and "break" not in content
            
            # Overall validity
            valid = validation_details.get("syntax_valid", False) and not validation_details.get("potential_infinite_loop", False)
        else:
            # Basic validation for other languages
            valid = True  # Default to assuming valid
            validation_details["note"] = f"Detailed validation for {language} not implemented"
        
        return valid, validation_details
    
    return validate_code_thought

# Example usage of domain-specific validators
domain_validators = {
    "math": create_math_validator(),
    "python_code": create_code_validator("python"),
    "javascript_code": create_code_validator("javascript")
}

# Use in the scoring validator
scorer = ScoringValidator(domain_validators=domain_validators)

For domain-specific tasks, you’ll want to add specialized scoring functions. For example:

# Example specialized scoring for a mathematical reasoning task
math_scoring = {
    "numerical_correctness": lambda thought: check_numerical_result(thought["content"]),
    "step_clarity": lambda thought: score_step_clarity(thought["content"]),
    "error_count": lambda thought: -count_errors(thought["content"])  # Negative to penalize errors
}

# Example specialized scoring for code generation
code_scoring = {
    "syntax_validity": lambda thought: check_code_syntax(thought["content"]),
    "test_coverage": lambda thought: run_test_cases(thought["content"]),
    "efficiency": lambda thought: -estimate_complexity(thought["content"])  # Negative for penalization
}

3.4 Controller & Graph of Operations

The Controller orchestrates the reasoning process according to a predefined or dynamic Graph of Operations:

class Controller:
    def __init__(self, graph_of_operations, scoring_validator, llm_api, prompter, parser):
        """
        Initialize the controller with components and operation plan.
        
        :param graph_of_operations: List of operation dictionaries defining the reasoning strategy
        :param scoring_validator: ScoringValidator instance
        :param llm_api: Function to call the LLM
        :param prompter: Prompter instance
        :param parser: Parser instance
        """
        self.go_ops = graph_of_operations
        self.scoring_validator = scoring_validator
        self.llm_api = llm_api
        self.prompter = prompter
        self.parser = parser
        
        # Initialize graph data structure
        self.graph = {
            "nodes": [],  # List of thought dictionaries
            "edges": [],  # List of (from_id, to_id, metadata) tuples
            "metadata": {"created_at": time.time()}
        }
        self.active_nodes = []  # Currently active nodes
        self.node_counter = 0   # For generating unique IDs
    
    def initialize_graph(self, initial_state):
        """
        Set up the graph with initial thought(s).
        
        :param initial_state: String or list of strings as initial thoughts
        """
        if isinstance(initial_state, str):
            initial_states = [initial_state]
        else:
            initial_states = initial_state
            
        for state in initial_states:
            node_id = self._add_node({"content": state})
            self.active_nodes.append(node_id)
    
    def run(self):
        """
        Execute the graph of operations plan.
        
        :return: Final solution node
        """
        for op_idx, op in enumerate(self.go_ops):
            print(f"Executing operation {op_idx+1}/{len(self.go_ops)}: {op['type']}")
            
            if op["type"] == "generate":
                self._execute_generate(op)
            elif op["type"] == "merge":
                self._execute_merge(op)
            elif op["type"] == "refine":
                self._execute_refine(op)
            elif op["type"] == "select":
                self._execute_select(op)
            # Additional operation types can be added here
            
            # Optional: dynamic adjustment of the plan based on current state
            if op.get("adaptive", False) and op_idx < len(self.go_ops) - 1:
                self._adjust_plan(op_idx + 1)
        
        return self._get_final_solution()
    
    def _add_node(self, thought):
        """Add a new node to the graph and return its ID."""
        node_id = self.node_counter
        self.node_counter += 1
        
        # Add unique ID and score if not present
        thought["id"] = node_id
        if "score_overall" not in thought:
            thought = self.scoring_validator.score_and_validate(thought)
            
        self.graph["nodes"].append(thought)
        return node_id
    
    def _add_edge(self, from_id, to_id, metadata=None):
        """Add a new edge to the graph."""
        edge = (from_id, to_id, metadata or {})
        self.graph["edges"].append(edge)
    
    def _execute_generate(self, op):
        """Generate new thoughts from active nodes."""
        k = op.get("k", 3)  # Number of thoughts to generate per node
        metrics = op.get("metrics", None)  # Scoring metrics to use
        
        new_active_nodes = []
        
        for node_id in self.active_nodes:
            node = self._get_node_by_id(node_id)
            
            # Create prompt for generation
            prompt = self.prompter.create_prompt(
                "generate", 
                [node], 
                context=op.get("context")
            )
            
            # Get LLM response
            llm_output = self.llm_api(prompt)
            
            # Parse into new thoughts
            new_thoughts = self.parser.parse_llm_output(llm_output, "generate")
            
            # Limit to k thoughts if needed
            for thought in new_thoughts[:k]:
                # Score and add to graph
                thought = self.scoring_validator.score_and_validate(thought, metrics)
                new_node_id = self._add_node(thought)
                self._add_edge(node_id, new_node_id, {"operation": "generate"})
                new_active_nodes.append(new_node_id)
        
        # Update active nodes
        if op.get("update_active", True):
            self.active_nodes = new_active_nodes
    
    def _execute_merge(self, op):
        """Merge multiple nodes into new thought(s)."""
        input_selector = op.get("input_selector", "active")
        repeat = op.get("repeat", 1)
        
        if input_selector == "active":
            nodes_to_merge = [self._get_node_by_id(nid) for nid in self.active_nodes]
        elif input_selector == "top_k":
            k = op.get("k", 3)
            nodes_to_merge = self._get_top_k_nodes(k)
        else:
            # Custom selection logic could be added here
            nodes_to_merge = [self._get_node_by_id(nid) for nid in op.get("node_ids", [])]
        
        new_active_nodes = []
        
        for _ in range(repeat):
            # Create merge prompt
            prompt = self.prompter.create_prompt(
                "merge", 
                nodes_to_merge, 
                context=op.get("context")
            )
            
            # Get LLM response
            llm_output = self.llm_api(prompt)
            
            # Parse merged thought(s)
            merged_thoughts = self.parser.parse_llm_output(llm_output, "merge")
            
            for thought in merged_thoughts:
                # Score and add to graph
                thought = self.scoring_validator.score_and_validate(thought)
                new_node_id = self._add_node(thought)
                
                # Add edges from all input nodes to the merged node
                for node in nodes_to_merge:
                    self._add_edge(node["id"], new_node_id, {"operation": "merge"})
                
                new_active_nodes.append(new_node_id)
        
        # Update active nodes if specified
        if op.get("update_active", True):
            self.active_nodes = new_active_nodes
    
    def _execute_refine(self, op):
        """Refine existing nodes."""
        input_selector = op.get("input_selector", "active")
        metrics = op.get("metrics", None)
        
        if input_selector == "active":
            nodes_to_refine = [self._get_node_by_id(nid) for nid in self.active_nodes]
        elif input_selector == "top_k":
            k = op.get("k", 1)
            nodes_to_refine = self._get_top_k_nodes(k)
        else:
            nodes_to_refine = [self._get_node_by_id(nid) for nid in op.get("node_ids", [])]
        
        new_active_nodes = []
        
        for node in nodes_to_refine:
            # Create refinement prompt
            prompt = self.prompter.create_prompt(
                "refine", 
                [node], 
                context=op.get("context")
            )
            
            # Get LLM response
            llm_output = self.llm_api(prompt)
            
            # Parse refined thought
            refined_thoughts = self.parser.parse_llm_output(llm_output, "refine")
            
            for thought in refined_thoughts:
                # Score and add to graph
                thought = self.scoring_validator.score_and_validate(thought, metrics)
                new_node_id = self._add_node(thought)
                self._add_edge(node["id"], new_node_id, {"operation": "refine"})
                new_active_nodes.append(new_node_id)
        
        # Update active nodes if specified
        if op.get("update_active", True):
            self.active_nodes = new_active_nodes
    
    def _execute_select(self, op):
        """Select a subset of nodes based on criteria."""
        criteria = op.get("criteria", "top_k")
        update_active = op.get("update_active", True)
        
        if criteria == "top_k":
            # Select top-k nodes by a specified metric
            k = op.get("k", 1)
            metric = op.get("metric", "score_overall")
            selected_nodes = self._get_top_k_nodes(k, metric)
            selected_node_ids = [node["id"] for node in selected_nodes]
            
        elif criteria == "valid_only":
            # Select only nodes marked as valid
            selected_node_ids = [
                node["id"] for node in self.graph["nodes"] 
                if node.get("valid", False) and node["id"] in self.active_nodes
            ]
            
        elif criteria == "threshold":
            # Select nodes above a threshold value for a metric
            metric = op.get("metric", "score_overall")
            threshold = op.get("threshold", 0.5)
            selected_node_ids = [
                node["id"] for node in self.graph["nodes"]
                if node.get(metric, 0) >= threshold and node["id"] in self.active_nodes
            ]
            
        elif criteria == "latest":
            # Select the most recently added nodes
            k = op.get("k", 1)
            sorted_nodes = sorted(
                [n for n in self.graph["nodes"] if n["id"] in self.active_nodes],
                key=lambda n: n.get("created_at", 0),
                reverse=True
            )
            selected_node_ids = [node["id"] for node in sorted_nodes[:k]]
            
        elif criteria == "custom":
            # Use a custom selection function if provided
            custom_selector = op.get("selector_function")
            if custom_selector and callable(custom_selector):
                selected_node_ids = custom_selector(self.graph, self.active_nodes)
            else:
                selected_node_ids = self.active_nodes
                
        else:
            # Default to keeping current active nodes
            selected_node_ids = self.active_nodes
        
        # Ensure we have at least one selected node
        if not selected_node_ids:
            print("Warning: No nodes selected. Keeping all active nodes.")
            selected_node_ids = self.active_nodes
        
        # Update active nodes if specified
        if update_active:
            self.active_nodes = selected_node_ids
            
        return selected_node_ids
    
    def _get_node_by_id(self, node_id):
        """Get a node by its ID."""
        for node in self.graph["nodes"]:
            if node["id"] == node_id:
                return node
        return None
    
    def _get_top_k_nodes(self, k, metric="score_overall"):
        """Get the top k nodes by the specified metric."""
        sorted_nodes = sorted(
            self.graph["nodes"], 
            key=lambda n: n.get(metric, 0), 
            reverse=True
        )
        return sorted_nodes[:k]
    
    def _get_final_solution(self):
        """Select the final solution based on scoring."""
        valid_nodes = [n for n in self.graph["nodes"] if n.get("valid", False)]
        
        if not valid_nodes:
            print("Warning: No valid nodes found. Returning best available node.")
            return max(self.graph["nodes"], key=lambda n: n.get("score_overall", 0))
        
        return max(valid_nodes, key=lambda n: n.get("score_overall", 0))

A Graph of Operations configures the reasoning strategy. Here’s an example for a complex problem-solving task:

graph_of_operations = [
    # Initial exploration: generate 3 different approaches for each starting point
    {"type": "generate", "k": 3, "metrics": ["complexity", "relevance"]},
    
    # Select top 5 ideas based on overall score
    {"type": "select", "criteria": "top_k", "k": 5, "metric": "score_overall"},
    
    # Generate 2 refinements for each selected idea
    {"type": "generate", "k": 2, "context": "Focus on addressing weaknesses in the approach"},
    
    # Merge the best ideas into 2 candidate solutions
    {"type": "merge", "input_selector": "top_k", "k": 6, "repeat": 2},
    
    # Final refinement of the merged solutions
    {"type": "refine", "context": "Polish the solution for completeness and accuracy"}
]

4. Core Transformations

Graph of Thoughts offers three fundamental transformations that distinguish it from simpler reasoning paradigms:

4.1 Generation

Generation produces multiple new thoughts from existing nodes. While superficially similar to branching in tree-based approaches, GoT allows generation from any node in the graph, not just leaf nodes. This enables:

  • Parallel exploration of multiple reasoning paths
  • Restarting exploration from promising intermediate states
  • Multiple perspectives on a single thought
Flowchart Diagram

The generation operation might be implemented as:

def generate(thought, llm, k=3):
    """Generate k new thoughts from an existing thought."""
    prompt = f"Based on this idea: '{thought['content']}', generate {k} different extensions or alternatives."
    response = llm(prompt)
    new_thoughts = parse_into_separate_thoughts(response)
    return new_thoughts[:k]

4.2 Refinement

Refinement represents the ability to loop back and improve an existing thought. This creates cycles in the reasoning graph, allowing iterative improvement that’s impossible in tree structures. Refinement is crucial for:

  • Error correction when problems are identified
  • Progressive enhancement of promising solutions
  • Specialized improvement targeting specific aspects of a solution
Flowchart Diagram

Refinement might focus on different aspects depending on the domain:
In code generation: improving efficiency, readability, or correctness
In writing: enhancing clarity, structure, or persuasiveness
In reasoning: addressing logical gaps or missing evidence

Example implementation:

def refine(thought, llm, focus=None):
    """Refine an existing thought, optionally with specific focus."""
    focus_text = f" Focus specifically on improving {focus}." if focus else ""
    prompt = f"Improve the following thought:{thought['content']}{focus_text}"
    response = llm(prompt)
    return {"content": response, "derived_from": thought["id"]}

4.3 Aggregation

Aggregation (or merging) combines multiple thoughts into a new synthesis. This is perhaps the most powerful transformation in GoT, as it enables:

  • Consolidation of complementary insights
  • Resolution of contradictions between different approaches
  • Comprehensive solutions that incorporate multiple perspectives
Flowchart Diagram

Tree structures fundamentally cannot represent merges from separate branches, making GoT uniquely suited for problems requiring solution combination.

def merge(thoughts, llm):
    """Combine multiple thoughts into a new synthesis."""
    thought_texts = [t["content"] for t in thoughts]
    formatted_thoughts = "\n".join([f"{i+1}. {t}" for i, t in enumerate(thought_texts)])
    
    prompt = f"Combine these ideas into a single coherent solution:\n{formatted_thoughts}\n\nSynthesized solution:"
    response = llm(prompt)
    
    return {
        "content": response,
        "derived_from": [t["id"] for t in thoughts]
    }

These core transformations can be combined in flexible ways to address complex problems, giving GoT its power and versatility.


5. Case Study: Document Merger

Let’s explore a real-world application of Graph of Thoughts: merging Non-Disclosure Agreements (NDAs) into a consolidated document.

5.1 Problem Statement

You have multiple NDAs with overlapping but distinct clauses. The goal is to produce a single comprehensive document that:
– Retains all essential clauses
– Eliminates redundancy
– Maintains coherence and readability
– Preserves legal enforceability

This task is challenging because it requires understanding legal content, identifying similarities and differences, and producing structured output with precise language.

5.2 GoT Implementation

First, we define the Graph of Operations for NDA merging:

nda_merger_operations = [
    # 1. Extract key clauses from each document
    {"type": "generate", "k": 1, "context": "Extract and list all key clauses from this document"},
    
    # 2. Group similar clauses across documents
    {"type": "merge", "context": "Group similar clauses from different documents, highlighting differences"},
    
    # 3. Draft initial merged document (2 versions)
    {"type": "generate", "k": 2, "context": "Create a draft merged NDA incorporating all essential clauses"},
    
    # 4. Evaluate drafts for coverage and redundancy
    {"type": "refine", "context": "Improve coverage of key clauses while reducing redundancy"},
    
    # 5. Final polish for coherence and legal precision
    {"type": "refine", "context": "Ensure the document flows logically and maintains precise legal language"}
]
Flowchart Diagram

5.3 Implementation Details

For the NDA merging task, we need specialized components:

Custom Prompter Templates

nda_templates = {
    "generate": "Analyze the following NDA document and extract all key clauses as a bulleted list:\n{node_list}\n\nExtracted clauses:",
    
    "merge": "Compare and group similar clauses across these documents:\n{node_list}\n\nFor each group, highlight differences in terminology, scope, or requirements.",
    
    "refine": "Improve the following draft merged NDA:\n{node_list}\n\nFocus on: {context}"
}

Specialized Scoring Functions

nda_scoring = {
    "coverage": lambda thought: score_clause_coverage(thought["content"], original_clauses),
    "redundancy": lambda thought: -score_redundancy(thought["content"]),  # Negative to penalize
    "coherence": lambda thought: score_document_coherence(thought["content"]),
    "legal_precision": lambda thought: score_legal_terminology(thought["content"])
}

Sample Execution Flow

  1. Extract clauses from each original NDA
  2. Group similar clauses across documents
  3. Draft initial merged document (multiple versions)
  4. Score drafts for coverage, redundancy, etc.
  5. Refine best draft for completeness
  6. Final polish for coherence and legal precision

5.4 Results

Besta et al. report that this GoT approach produces higher quality merged documents compared to:
– Direct one-shot merging (asking the LLM to merge documents in a single step)
– Chain-of-Thought approaches (linear refinement without the grouping step)
– Tree-of-Thought approaches (exploring multiple drafts but without the clause extraction and grouping)

The key advantage is that GoT explicitly represents and manipulates the relationships between clauses across documents before attempting to merge them, leading to more comprehensive coverage with less redundancy.


6. Benefits, Caveats, and Practical Tips

6.1 Benefits

Enhanced Reasoning Capabilities

  • Non-linear exploration: Freedom to revisit and refine earlier thoughts
  • Parallel computation: Maintain multiple approaches simultaneously
  • Solution synthesis: Combine strengths from different reasoning paths

Empirical Advantages

  • Higher accuracy: Particularly on complex, multi-step problems
  • Better resource efficiency: 15-35% token savings compared to other methods (Besta et al., 2023)
  • More robust solutions: Less sensitivity to prompt phrasing

Architectural Flexibility

  • Problem-specific customization: Tailor operations to domain requirements
  • Adaptive reasoning: Dynamically adjust the operation graph based on intermediate results
  • Verifiability: Graph structure makes reasoning steps explicit and traceable

6.2 Caveats

Implementation Complexity

  • State management: Tracking graph state requires more complex infrastructure
  • Operation design: Creating effective operation sequences requires expertise
  • Prompt engineering: More sophisticated prompts needed for graph operations

Resource Considerations

  • Multiple LLM calls: More individual calls (though potentially fewer tokens overall)
  • Latency: Multi-step reasoning increases end-to-end time
  • Scoring overhead: Evaluating node quality may require additional computation

LLM Limitations

  • Context window constraints: Large graphs may exceed context limits
  • Instruction following: LLMs may struggle with complex graph operations
  • Consistency: Maintaining coherence across multiple reasoning steps

6.3 Practical Tips

Starting Simple

  • Begin with small graphs: Start with 3-5 operations before scaling up
  • Constrained branching: Use k=2 or k=3 for generation to avoid combinatorial explosion
  • Clear operation semantics: Define precise expectations for each graph operation

Optimizing LLM Interaction

  • Structured output formats: Request JSON responses for easier parsing
  • Operation-specific prompting: Tailor prompts to each operation type
  • Example-based prompting: Include examples of desired outputs in prompts

Graph Management

  • Pruning strategies: Discard low-quality nodes early to manage graph size
  • Focused merging: Combine only the most promising thought branches
  • Incremental refinement: Use multiple smaller refinements instead of one large step

Adapting to Your Task

  • Domain-specific scoring: Develop metrics relevant to your problem domain
  • Custom operations: Create specialized transformations for your use case
  • Hybrid approaches: Combine GoT with other reasoning techniques when appropriate

7. Implementation Patterns

Here are some common implementation patterns for different problem domains:

7.1 Mathematical Problem Solving

For complex mathematical tasks, use a pattern that emphasizes validation and multiple solution paths:

math_operations = [
    {"type": "generate", "k": 5, "context": "Generate different approaches to this problem"},
    {"type": "refine", "context": "Work out each approach step-by-step"},
    {"type": "validate", "context": "Check each solution for correctness"},
    {"type": "select", "criteria": "valid_only"},
    {"type": "merge", "context": "Synthesize a comprehensive solution with multiple methods"}
]

7.2 Code Generation

For software development tasks, incorporate testing and refactoring:

code_operations = [
    {"type": "generate", "k": 3, "context": "Generate alternative implementations"},
    {"type": "validate", "context": "Check code against test cases"},
    {"type": "refine", "context": "Optimize for efficiency and readability"},
    {"type": "merge", "context": "Combine the best elements of each implementation"},
    {"type": "refine", "context": "Final polish and documentation"}
]

7.3 Content Creation

For writing tasks, focus on iterative improvement:

writing_operations = [
    {"type": "generate", "k": 3, "context": "Generate outline and key points"},
    {"type": "select", "criteria": "top_k", "k": 1},
    {"type": "generate", "k": 1, "context": "Expand into a first draft"},
    {"type": "refine", "context": "Improve clarity and structure"},
    {"type": "refine", "context": "Enhance style and engagement"},
    {"type": "refine", "context": "Final proofreading and polish"}
]

7.4 Comparison of Implementation Patterns

The table below compares key characteristics of these implementation patterns:

Domain Key Operations Branching Factor Validation Emphasis Merge Approach Main Advantage
Mathematical Problem Solving Generate → Refine → Validate → Merge High (k=5) Strong focus on correctness Late merge after validation Multiple solution methods
Code Generation Generate → Validate → Refine → Merge → Refine Medium (k=3) Testing-based validation Combines best implementation elements Balance of efficiency and correctness
Content Creation Generate → Select → Generate → Multiple Refinements Low to Medium (k=3 → k=1) Subjective quality metrics Early selection instead of merging Iterative improvement cycle
Document Merger (NDA case study) Extract → Group → Draft → Refine → Polish Low (k=1-2) Coverage and redundancy metrics Hierarchical merging (clauses, then docs) Structured composition

The key insight is that different problem domains benefit from different graph structures:
Mathematical problems need parallel exploration and careful validation
Code generation requires testing cycles and selective combination
Content creation focuses on incremental refinement of a chosen direction
Document merging uses hierarchical composition from elements to full documents

Each pattern leverages GoT’s core transformations in different proportions, with some emphasizing generation (mathematical problem solving), others refinement (content creation), and others merging (document merger).

8. References

  1. Besta et al. (2023). Graph of Thoughts: Solving Elaborate Problems with Large Language Models. arXiv Preprint 2308.09687.
  2. Wei et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
  3. Wang et al. (2023). Self-Consistency Improves Chain-of-Thought Reasoning in Language Models.
  4. Yao et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models.
  5. Liu et al. (2023). LLM Powered Autonomous Agents.

9. Conclusion

Graph of Thoughts represents a significant advancement in how we prompt and structure reasoning with large language models. By breaking free from the constraints of linear or tree-based thinking, GoT enables a more flexible, iterative, and combinatorial approach to problem-solving—one that more closely resembles how humans tackle complex problems.

The key innovations of GoT are:

  1. Flexible reasoning structures: The graph abstraction allows for cycles, merges, and non-linear exploration.
  2. Explicit intermediate states: All “thoughts” are preserved and available for reference or refinement.
  3. Customizable operations: The framework adapts to different problem domains through specialized operations.

As LLMs continue to advance, frameworks like GoT will become increasingly important for harnessing their capabilities on complex tasks. Future research directions include:

  • Automated operation planning: Dynamically constructing the Graph of Operations based on the problem
  • Multi-agent GoT: Using multiple specialized agents within the same graph framework
  • Human-in-the-loop integration: Combining human guidance with LLM reasoning in a shared graph

By implementing GoT in your own applications, you can unlock more sophisticated reasoning capabilities, produce higher quality solutions, and potentially reduce overall token usage through more efficient problem-solving strategies.

The graph-based future of LLM reasoning is just beginning, and we invite you to explore its potential in your own projects.

Posted in AI / ML, LLM Advanced, LLM Research
Write a comment