Abstract
This piece dissects OPRO (Optimization by PROmpting), a framework that repurposes large language models (LLMs) not just as generators or reasoners, but as black-box optimizers. Instead of leaning on traditional algorithms, OPRO guides an LLM to iteratively refine solutions using nothing more than natural language prompts describing the task and solution history. We’ll unpack the mechanics, examine its performance across different domains, and show how it’s implemented.1. Introduction
Optimization is fundamental. It’s the bedrock of finding better ways – whether refining parameters in linear regression, charting the traveling salesman problem, or wrestling with the newer beast of prompt engineering. The classical toolkit involves formal algorithms, calculus, gradients, or painstakingly tuned heuristics – machinery requiring precise specifications and often deep domain expertise. OPRO (“Optimization by PROmpting”) throws a curveball. It proposes using Large Language Models (LLMs) themselves as the optimization engine. Forget viewing LLMs merely as function approximators or clumsy reasoners for a moment. OPRO’s core premise is starkly different:“Given an objective function we can evaluate, let the LLM iteratively propose better solutions. Guide it purely through a textual ‘meta-prompt’ that tracks the history of attempts and scores.”In essence, OPRO reframes optimization as a prompting challenge. The LLM tries to one-up its previous suggestions, guided only by a description of the game and its past moves.
Why Does OPRO Warrant Attention?
- Sidesteps Formal Solvers: No need to implement complex optimization libraries. You need an objective function and a way to log attempts. The LLM is the solver.
- Unmatched Flexibility (in Theory): LLMs digest natural language constraints, examples, and stylistic goals (“make it concise,” “ensure it sounds ethical”) far more readily than rigid mathematical formulations. This opens doors for fuzzier, more creative optimization criteria.
- Surprisingly Competitive Performance: Let’s be clear, OPRO won’t dethrone specialized solvers for large, well-defined numerical problems. But for moderate tasks, especially those lacking established methods (like optimizing prompts themselves), it delivers non-trivial results.
- Democratizes the Field (Maybe): It potentially lowers the barrier to entry, allowing those without heavy math or coding backgrounds to engage with optimization concepts through language.
2. The OPRO Framework: Under the Hood
OPRO runs on a feedback loop that’s almost deceptively simple, yet potent:Core Components
- Objective Function Evaluator The black box that assigns a score (or loss, cost, etc.) to any proposed solution. If optimizing a math prompt, this function runs the prompt on a validation set and returns the accuracy. Simple.
- Meta-Prompt This is the crucial textual instruction set steering the LLM. It typically bundles:
- A clear statement of the goal and the expected format of a solution.
- A running history of past proposed solutions and their scores.
- An explicit command to generate new solutions intended to beat the best scores seen so far.
- Optional seasoning: constraints, examples of good/bad solutions, domain hints.
- LLM as the Optimizer The language model acts as the search engine. In each cycle:
- It reads the current meta-prompt (with updated history).
- It generates one or several new candidate solutions based on the prompt’s instructions.
- These candidates get fed to the Objective Function Evaluator.
- The newly scored solutions (at least the good ones) are added back into the meta-prompt for the next round.
- Stopping Criteria The loop terminates when a preset iteration budget is exhausted, or when the LLM seems unable to find better solutions (diminishing returns).
The Optimization Loop

Key Innovation
The subtle power of OPRO lies in how the LLM implicitly navigates the exploration vs. exploitation dilemma through natural language cues. Adjusting the meta-prompt with phrases like “Suggest solutions significantly different from previous attempts” versus “Propose minor refinements to the current best solution” allows high-level control without needing formal parameters like learning rates or momentum terms. The sampling temperature, as usual, acts as a knob: moderate values (around 1.0) encourage a mix of reusing successful patterns and trying novel variations.3. Case Studies: OPRO in the Wild
Let’s see how OPRO fares when applied to concrete optimization tasks, moving from basic numerical fitting to combinatorial search.3.1 Linear Regression: A Foundational Example
Linear regression is almost trivial numerically, but serves as a clean sandbox to observe OPRO. The goal: find parameters ((w, b)) minimizing the sum of squared errors (SSE) for given data points ((x_i, y_i)).import random
def sse(w, b, X, Y):
"""Compute Sum of Squared Errors for a dataset."""
return sum((y - (w*x + b))**2 for x, y in zip(X, Y))
# Generate synthetic data
X = list(range(1, 51))
w_true, b_true = 3, 10
# Introduce some noise
Y = [w_true*x + b_true + random.gauss(0, 5) for x in X] # Increased noise slightly for clarity
# Initialize with random solutions
solutions = []
for _ in range(5):
w_init = random.uniform(0, 5)
b_init = random.uniform(0, 15)
error = sse(w_init, b_init, X, Y)
solutions.append((w_init, b_init, error))
def build_meta_prompt(solutions):
# Sort solutions by their SSE (ascending = better)
solutions_sorted = sorted(solutions, key=lambda s: s[2])
text = "# Linear Regression Optimization\n\n"
text += "We want to find the parameters w and b for the model y = w*x + b that best fit the provided data by minimizing the Sum of Squared Errors (SSE).\n\n"
text += "## Previous Solutions (ordered from best SSE to worst):\n\n" # Changed order for clarity
# Display best solutions first to potentially leverage LLM recency bias
for w_, b_, val in solutions_sorted:
text += f"* w={w_:.4f}, b={b_:.4f} → SSE={val:.2f}\n"
text += "\n## Your Task:\n"
text += "Based on the solutions above, propose NEW values for (w, b) that are likely to yield an EVEN LOWER SSE than the current best.\n"
text += "Analyze the trend: how do changes in w and b affect the SSE?\n"
text += "Output ONLY the proposed parameters in the format: (w=X.XXX, b=Y.YYY)\n" # Clarified output format
return text
# Placeholder function to simulate LLM call
def ask_llm_for_solutions(prompt, n=4, temperature=1.0):
"""
Placeholder function to simulate querying an LLM.
In a real implementation, this would call an LLM API (e.g., OpenAI, Anthropic, Google).
It needs to parse the LLM's textual response back into (w, b) tuples.
"""
print("--- Sending Meta-Prompt to LLM ---")
# print(prompt) # Uncomment to see the prompt structure
print("--- LLM Generating Solutions (Simulated) ---")
# Simulate LLM behavior - generate plausible next steps based on best current solution
if not solutions: # Handle initial call if needed
return [(random.uniform(0,5), random.uniform(0,15)) for _ in range(n)]
best_w, best_b, best_sse = min(solutions, key=lambda s: s[2])
# Simple simulation: Generate variations around the best solution
# A real LLM would use the full history and natural language instructions
generated_solutions = []
for _ in range(n):
# Perturb slightly - real LLM would be smarter
w_new = best_w + random.gauss(0, 0.5 * (temperature**0.5))
b_new = best_b + random.gauss(0, 1.0 * (temperature**0.5))
generated_solutions.append((w_new, b_new))
# Sometimes simulate a wild guess (exploration)
if random.random() < 0.2 * temperature:
generated_solutions.append((random.uniform(0, 5), random.uniform(0, 15)))
print(f"Simulated LLM response: {generated_solutions[:n]}") # Show generated candidates
return generated_solutions[:n] # Return requested number
# --- Optimization Loop ---
num_steps = 10
max_solutions_in_prompt = 15 # Limit context window
for step in range(num_steps):
print(f"\n--- Iteration {step+1}/{num_steps} ---")
# Build the meta-prompt using current solutions
prompt = build_meta_prompt(solutions)
# Query the LLM for new guesses
# Using placeholder function here
new_candidates = ask_llm_for_solutions(prompt, n=4, temperature=1.0)
if not new_candidates: # Handle case where LLM fails to generate
print("LLM failed to generate candidates, stopping.")
break
# Evaluate each new candidate
evaluated_candidates = []
for (w_guess, b_guess) in new_candidates:
# Basic validation (optional but good practice)
if isinstance(w_guess, (int, float)) and isinstance(b_guess, (int, float)):
val = sse(w_guess, b_guess, X, Y)
evaluated_candidates.append((w_guess, b_guess, val))
else:
print(f"Warning: Skipping invalid candidate format: ({w_guess}, {b_guess})")
# Add newly evaluated solutions to our list
solutions.extend(evaluated_candidates)
# Keep only top N solutions overall to manage context length for next prompt
solutions = sorted(solutions, key=lambda s: s[2])[:max_solutions_in_prompt]
# Report the best solution found so far
current_best = solutions[0]
print(f"Best solution after iteration {step+1}: w={current_best[0]:.4f}, b={current_best[1]:.4f}, SSE={current_best[2]:.2f}")
# Final Report
best_w, best_b, best_sse = solutions[0]
print("\n--- Optimization Complete ---")
print(f"Best solution found: w={best_w:.4f}, b={best_b:.4f}, SSE={best_sse:.2f}")
print(f"True values were: w={w_true}, b={b_true}")
# Calculate SSE for true values (for comparison)
true_sse = sse(w_true, b_true, X, Y)
print(f"SSE for true parameters (with noise): {true_sse:.2f}")
w
and b
across iterations (as listed in the meta-prompt), the LLM can make educated guesses about directions likely to yield improvement, converging towards reasonable parameter values.
3.2 Traveling Salesman Problem (TSP)
TSP is the canonical NP-hard combinatorial challenge: find the shortest tour visiting each city once before returning home.
- Objective: A function calculating the total Euclidean distance of a proposed city permutation (tour).
- Meta-prompt: Provide city coordinates, list previously attempted routes, and their calculated distances.
- Solution Generation: Instruct the LLM to propose new city permutations likely to be shorter.
# Traveling Salesman Problem Optimization
## Cities & Coordinates:
A: (0, 0)
B: (1, 3)
C: (4, 2)
D: (3, 5)
E: (7, 1)
## Previously Evaluated Routes (Best to Worst):
Route: A → E → D → C → B → A, Total Distance: 20.3
Route: A → C → B → D → E → A, Total Distance: 22.7
Route: A → B → C → D → E → A, Total Distance: 24.1
## Your Task:
Propose a NEW route (a permutation of A, B, C, D, E, starting and ending at A) that results in a SHORTER total distance than the current best (20.3).
Analyze the previous routes: which city transitions seem particularly long? Can you rearrange the sequence to avoid them while visiting all cities?
Output ONLY the route sequence, e.g., A -> X -> Y -> Z -> W -> A.
4. Prompt Optimization: OPRO’s Killer Application
Where OPRO truly carves its niche, arguably its killer app, is Prompt Optimization. Why? Because here the search space is language. We’re optimizing the very incantations used to coax behavior from another LLM. Traditional numerical methods are simply out of their depth when the ‘parameters’ are words, sentences, and stylistic nuances.The Self-Improvement Loop
It becomes meta: use an LLM (guided by OPRO) to find better instructions for another LLM (or even itself) on some task.- Task Definition: Pick a target task (e.g., solving grade-school math problems, writing Python code snippets) with a corresponding dataset (input-output pairs).
- Objective Function: Evaluate a candidate prompt (the “instruction”) by running it against the dataset with a target LLM and measuring performance (e.g., accuracy, code correctness score).
- OPRO Loop:
- Maintain a list of tried prompts and their measured performance.
- Feed this history into the meta-prompt, possibly including examples of task inputs/outputs.
- Instruct the OPRO LLM to generate new prompts anticipated to yield better performance on the task.

Practical Implementation
A sketch of the code for optimizing math prompts:import random # Ensure random is imported
# Assume 'training_data' is a list of (question, correct_answer) tuples
# Assume 'model' is an interface to the target LLM being optimized for
# Assume 'opro_llm' is an interface to the LLM running the OPRO optimization
def compute_instruction_accuracy(instruction, training_data, model):
"""Calculate the accuracy of a given instruction prompt on a dataset using the target model."""
correct = 0
if not training_data: return 0.0 # Handle empty data case
for question, correct_answer in training_data:
# Combine instruction with question for the target model
prompt = f"{instruction}\n\nQuestion: {question}\nAnswer:"
# Query the target LLM
try:
# Placeholder for actual model call
# model_answer = model.generate(prompt)
# Simulate a response for demonstration
model_answer = f"Simulated answer based on '{instruction[:20]}...'"
# Placeholder for answer checking logic
# Replace with actual evaluation (e.g., comparing numerical results, string matching)
# Simulate correctness based on instruction length for demo
if len(instruction) > 15 and "step" in instruction.lower(): # Longer, step-by-step prompts are better?
is_correct = random.random() < 0.7
else:
is_correct = random.random() < 0.5
if is_correct: # is_answer_correct(model_answer, correct_answer):
correct += 1
except Exception as e:
print(f"Error processing question with instruction '{instruction}': {e}")
# Decide how to handle errors, e.g., count as incorrect or skip
return 100.0 * correct / len(training_data)
# Start with baseline instructions
initial_instructions = [
"Solve the following math problem.",
"Think step by step to solve this problem.",
"Let's carefully break down and solve this math problem."
]
# Evaluate initial instructions using the target model
# Using placeholder model here
solutions = []
mock_model = lambda p: "Simulated answer" # Simple mock model interface
mock_training_data = [ (f"What is {i}+{i+1}?", str(2*i+1)) for i in range(10)] # Mock data
for instr in initial_instructions:
# In real use, replace mock_model and mock_training_data
accuracy = compute_instruction_accuracy(instr, mock_training_data, mock_model)
solutions.append((instr, accuracy))
def build_meta_prompt_for_prompt_opt(solutions, example_problems):
# Sort by performance (highest accuracy first)
sorted_sols = sorted(solutions, key=lambda s: s[1], reverse=True)
prompt = "# Prompt Optimization for Math Problem Solving\n\n"
prompt += "The goal is to find the best instruction prompt to give to an LLM to maximize its accuracy on grade-school math problems.\n\n"
prompt += "## Previous Instructions Tried and Their Resulting Accuracy:\n\n"
for instr, acc in sorted_sols:
prompt += f"Instructions: \"{instr}\"\nAccuracy Achieved: {acc:.1f}%\n---\n"
prompt += "## Example Problems from the Evaluation Dataset:\n\n"
for question, answer in example_problems:
prompt += f"Question: {question}\nCorrect Answer: {answer}\n\n"
prompt += "## Your Task:\n"
prompt += "Analyze the instructions above. Generate a NEW instruction prompt that is likely to achieve HIGHER accuracy than the current best.\n"
prompt += "Consider what makes a good instruction for mathematical reasoning: clarity, encouraging step-by-step thinking, asking for checks, etc.\n"
prompt += "The new instruction should be distinct from the ones already tried.\n"
prompt += "Output ONLY the new instruction text."
return prompt
def generate_new_prompts_with_llm(prompt, opro_llm_interface, n=5, temperature=1.0):
"""
Generate new candidate instructions using the OPRO LLM.
"""
print("--- Sending Meta-Prompt to OPRO LLM ---")
# print(prompt) # Uncomment to see prompt
print("--- OPRO LLM Generating New Instructions (Simulated) ---")
# Placeholder for actual OPRO LLM call
# response = opro_llm_interface.generate(prompt, max_tokens=50*n, temperature=temperature, n=n)
# Parse 'response' to extract n instruction strings
# Simulate LLM generating related but new prompts
base_prompts = [s[0] for s in solutions]
new_prompts = []
for _ in range(n):
# Pick a previous prompt and modify it slightly (crude simulation)
if solutions and random.random() < 0.8:
chosen_prompt = random.choice(base_prompts)
variation = random.choice([" Explain your reasoning.", " Show your work.", " Double check calculations.", " Be precise."])
new_prompt = chosen_prompt + variation
# Ensure it's actually new
if new_prompt not in base_prompts and new_prompt not in new_prompts:
new_prompts.append(new_prompt)
else: # Fallback if modification fails or is duplicate
new_prompts.append(f"New Prompt Attempt {random.randint(100,999)}: Focus on logical steps.")
else: # Simulate more random exploration
new_prompts.append(f"Creative Prompt {random.randint(100,999)}: Use chain-of-thought reasoning.")
print(f"Simulated OPRO LLM response: {new_prompts}")
return new_prompts
# --- Prompt Optimization Loop ---
num_iterations = 5 # Reduced for brevity
max_prompts_in_history = 10
mock_opro_llm = lambda p, n, temp: generate_new_prompts_with_llm(p, None, n, temp) # Mock OPRO LLM
for iteration in range(num_iterations):
print(f"\n--- Prompt Optimization Iteration {iteration+1}/{num_iterations} ---")
# Select a few example problems for the meta-prompt context
examples = random.sample(mock_training_data, min(3, len(mock_training_data)))
# Build the meta-prompt
meta_prompt = build_meta_prompt_for_prompt_opt(solutions, examples)
# Generate new candidate instructions using the OPRO LLM (mocked here)
new_instructions = mock_opro_llm(meta_prompt, n=3, temperature=1.0) # Generate 3 new prompts
if not new_instructions:
print("OPRO LLM failed to generate new instructions.")
continue
# Evaluate these new instructions using the target model (mocked here)
evaluated_new_solutions = []
for instr in new_instructions:
if isinstance(instr, str) and instr.strip(): # Basic validation
accuracy = compute_instruction_accuracy(instr, mock_training_data, mock_model)
evaluated_new_solutions.append((instr, accuracy))
print(f"Evaluated new instruction: \"{instr[:60]}...\" -> Accuracy: {accuracy:.1f}%")
else:
print(f"Warning: Skipping invalid instruction format: {instr}")
# Add new results to our pool
solutions.extend(evaluated_new_solutions)
# Prune the history: keep only top performers
solutions = sorted(solutions, key=lambda s: s[1], reverse=True)[:max_prompts_in_history]
# Report progress
if solutions:
best_instr, best_acc = solutions[0]
print(f"Best accuracy after iteration {iteration+1}: {best_acc:.1f}%, Instruction: \"{best_instr}\"")
else:
print("No valid solutions found yet.")
# Final Result
print("\n--- Prompt Optimization Complete ---")
if solutions:
best_instr, best_acc = solutions[0]
print(f"Optimized prompt: \"{best_instr}\"")
print(f"Achieved accuracy (on training data): {best_acc:.1f}%")
else:
print("Optimization failed to find any valid prompts.")
Empirical Results
Published results suggest OPRO yields tangible gains:- GSM8K (grade-school math): Reported improvements of +8-12% accuracy over strong baseline prompts.
- Big-Bench Hard: Claims of up to +50% improvement on some complex reasoning tasks.
- Coding tasks: Demonstrable gains in code generation quality and functional correctness.
5. Practical Considerations and Best Practices
Getting OPRO to work well requires some finesse in setup and execution.5.1 Meta-Prompt Engineering
The quality of the meta-prompt is paramount. It’s the steering wheel.- Clarity is king: Define the objective and solution format unambiguously.
- History organization matters: Sorting solutions (e.g., best to worst) can influence the LLM. Recency bias is real.
- Explicit instructions: Directly ask for improvement over the current best. Tell it what you want.
- Constraint reinforcement: Remind the LLM of any hard constraints solutions must meet.
5.2 Balancing Exploration and Exploitation
The eternal optimization trade-off: search wide or dig deep?
- Temperature tuning: Start around 1.0. Nudge higher for more variety, lower for fine-tuning. Context matters.
- Diversity prompts: Explicitly ask for “substantially different solutions” if stuck in a local optimum.
- Batch generation: Generating multiple candidates (4-8) per iteration naturally increases exploration.
5.3 Managing Solution History
The meta-prompt grows with each iteration. This runs into context window limits. Managing this history is a non-trivial bookkeeping exercise.- Selective pruning: Don’t just keep the absolute best. Keep the top K, plus maybe a few diverse but lower-scoring outliers to avoid premature convergence.
- Sampling the trajectory: Include examples from early, mid, and late stages of the optimization to give the LLM a sense of the journey.
- Summarization (Advanced): For very long runs, potentially use another LLM call to summarize patterns or insights from the full history, feeding that summary into the main OPRO loop.
5.4 Avoiding Common Pitfalls
OPRO can go sideways. Watch out for:- Overfitting the objective: Solutions might game the specific evaluation set. Always validate final candidates on a separate, held-out test set.
- LLM Hallucinations: The model might propose solutions that are nonsensical or violate implicit constraints. Add validation logic before evaluation.
- Solution Stagnation: If the LLM keeps proposing minor variations of the same idea, explicitly prompt for diversity or increase temperature.
- Plateaus: Progress isn’t monotonic. Expect periods where improvement stalls. Sometimes reframing the task description in the meta-prompt can help.
6. Limitations and Future Directions
OPRO is clever, but it’s not magic. Understanding its boundaries is crucial.OPRO vs. Traditional Optimization Methods
Feature | OPRO | Traditional Methods (e.g., Gradient Descent, GA, Sim. Annealing) |
---|---|---|
Problem Specification | Natural Language Description | Formal Mathematical Model (equations, constraints) |
Implementation | Prompt Engineering, API Calls | Algorithm Implementation, Library Usage |
Gradient Required | No | Often Yes (for gradient-based), No (for derivative-free) |
Constraint Handling | Flexible (via language) | Rigid (must be mathematically encoded) |
Search Mechanism | LLM’s implicit reasoning/pattern matching | Explicit algorithmic search strategy |
Performance (Small/Med) | Often surprisingly good | Generally strong |
Performance (Large Scale) | Context window limited, can degrade | Often designed for scale |
Computational Cost | Can be high (many LLM calls) | Varies, often more efficient per “step” |
Explainability | Potentially high (LLM might explain choices) | Algorithm-dependent (gradients vs. black-box heuristics) |
Current Limitations
- Scaling: Struggles with very high-dimensional or large combinatorial problems where history becomes unwieldy. TSP with 1000 cities? Forget it.
- Context Window Bottleneck: The amount of history the LLM can consider is physically limited by its architecture.
- LLM Reliability: Performance hinges entirely on the underlying LLM’s capabilities. A weaker LLM makes a weaker optimizer. Biases or flaws in the LLM propagate.
- Cost: LLM API calls aren’t free. OPRO can be computationally expensive compared to lean numerical methods.
Future Research Directions
The OPRO concept is young. Plausible next steps include:- Hierarchical OPRO: Break down complex problems; use OPRO to optimize sub-problems, then another OPRO layer to combine solutions.
- Hybrid Systems: Use OPRO for initial exploration or for optimizing parts of a problem (like configuration) while using traditional solvers for numerical heavy lifting.
- Adaptive Meta-Prompting: Have another agent (or the LLM itself) dynamically refine the meta-prompt based on observed progress or lack thereof.
- Cross-Model Synergy: Use a powerful (expensive) LLM via OPRO to discover optimal prompts/strategies for a smaller, cheaper target LLM.

7. Conclusion: The OPRO Paradigm
OPRO offers a genuinely different perspective on optimization. By treating it as an iterative dialogue focused on improvement, mediated entirely through language, it unlocks capabilities previously demanding specialized code and expertise. Its core contribution is demonstrating that LLMs, with the right scaffolding, can act as general-purpose optimizers by learning from textual histories of success and failure. This is particularly potent for prompt optimization, a domain inherently resistant to traditional techniques due to its linguistic nature. As LLMs grow more sophisticated in their reasoning and context handling, OPRO’s power will likely increase. It might not replace classical optimization everywhere, but it represents a significant expansion of the toolkit, particularly for problems where flexibility, natural language constraints, and leveraging the LLM’s own “understanding” are paramount.Resources
- OPRO GitHub (official implementation): https://github.com/google-deepmind/opro
- Original Paper: “Large Language Models as Optimizers,” Yang et al. (2023)
Bottom Line: OPRO cleverly weaponizes the LLM’s ability to process and generate text, turning it into an iterative improvement engine. By feeding it a history of attempts and asking it to “do better,” we can guide it towards superior solutions, especially in domains like prompt engineering where the solution space itself is language. It’s optimization, recast as conversation.