Unboxing LLMs > loading...

September 5, 2023

Generative Agents: Building a Virtual Smallville with AI Citizens

Introduction: The Virtual World of AI Agents

Back in April 2023, a paper dropped from Stanford (Park et al.) that did more than just showcase another LLM trick. It sketched out a different kind of AI: generative agents. They’re computational puppets designed to mimic believable human-ish behavior within an open sandbox. They remember things (sort of), they have goals (kind of), they form relationships (maybe), and they react.

The researchers built Smallville (or “AI Town”)—a tiny digital petri dish populated by 25 of these agents. Each started with a bare-bones backstory – “John Lin runs the pharmacy, nice guy, family man…” – but the interesting part was watching them evolve. They developed routines, interacted, remembered (or misremembered) encounters, and built a rudimentary social fabric.

This isn’t The Sims with fancier dialogue trees. The agents in Smallville aren’t following rigid scripts. They exhibit emergent behavior. They wake up, figure out a plan for the day (often deviating), chat using natural language, and try to adapt when things go off-script. All of this orchestrated by a large language model pulling the strings, trying to maintain the illusion of coherence. Less programmed actors, more… digital beings stumbling through simulated lives.

(source: paper)

The Smallville Experiment: A Living Laboratory

The Digital Environment

Smallville itself is basic pixel art, a stage set for the AI drama:

  • Places to Be: Houses, the requisite cafe and bar, a park, shops, a school. The usual stage dressings.
  • Things to Touch: Objects agents can supposedly interact with – stoves, beds, desks. Not just background props.
  • Pixel People: Simple 2D sprites shuffling around the map.
  • Language Layer: Crucially, the environment isn’t just coordinates. It’s described in text. The stove isn’t just object #42; it’s “<Isabella’s stove>” allowing the LLM to “understand” and potentially update its state via prompts.

Smallville Environment

User Interaction Modes

You, the observer (or meddler), can poke the simulation:

  1. Chat: Talk to agents directly, pretending to be someone else (a reporter, a curious ghost).
  2. Tweak the World: Directly change the state description (e.g., tell the system “<Isabella’s stove> is now on fire”).
  3. Inner Monologue: Whisper suggestions into an agent’s “mind,” trying to nudge their intentions.

A Social Experiment in Action

The classic anecdote from the paper involves a Valentine’s Day party. One agent is prompted with the idea. What follows isn’t a broadcast message, but a slow diffusion through the agent’s gossip network. The idea spreads organically. Invitations are made, schedules are checked (and sometimes conflict), and eventually, a handful of agents actually show up. It’s messy, imperfect, and surprisingly realistic – a glimpse of simulated social dynamics emerging from simple rules and LLM calls.

Information Propagation


The Architecture: How Generative Agents Work

The magic, or perhaps the clever engineering, isn’t just throwing an LLM at the problem. It’s the cognitive scaffolding built around the LLM to give these agents a semblance of persistent identity and memory over time. LLMs famously have the memory of a goldfish, constrained by their context windows. This architecture tries to fake long-term memory.

It’s a five-stage loop cycling constantly for each agent:

  1. Perceive: What’s happening right now?
  2. Retrieve: What relevant memories do I have about this?
  3. Plan: What was I supposed to be doing? Does this change things?
  4. Act: Say something, do something.
  5. Reflect: Pause and synthesize recent experiences into broader insights.

Generative Agent Cognitive Loop

Let’s dissect this simulated brain.

1. Memory Stream: The Foundation of Persistent Identity

“A database to maintain a comprehensive record of an agent’s experience in natural language.”

This is the agent’s diary, a raw, chronological dump of everything it observes, does, or thinks. It’s stored as natural language strings:

"John Lin sees Sam Moore walk by the pharmacy at 9:05am."
"John Lin helps a customer find cold medicine at 9:20am."
"John Lin thinks that business has been slower than usual this week."

It’s comprehensive, sure, but also potentially enormous and unwieldy. An agent can’t possibly feed its entire life story back into the LLM for every decision. That leads to the next challenge.

classDiagram diagram

2. Memory Retrieval: Finding What Matters

Because the LLM’s context window is finite, the agent needs a way to surface relevant memories without being overwhelmed. This system uses a scoring heuristic, combining three factors:

  • Recency: Stuff that just happened matters more. Simple decay function.
  • Importance: Not all memories are created equal. The LLM itself is asked to rate the significance of each memory event (e.g., “planned a wedding” > “brushed teeth”). This feels a bit circular, but okay.
  • Relevance: How semantically similar is a past memory to the current situation? This uses vector embeddings and cosine similarity – standard stuff.

The scores are combined, the top N memories are retrieved, and only these are fed into the LLM prompt when deciding the next action. It’s a necessary hack to simulate recall within computational limits.

The scoring logic looks something like this:

\text{Score}(m) = \alpha \cdot \text{Recency}(m) + \beta \cdot \text{Importance}(m) + \gamma \cdot \text{Relevance}(m, q)

Where:

  • \text{Recency}(m) uses exponential decay based on time \Delta t.
  • \text{Importance}(m) is the LLM’s [0,1] score.
  • \text{Relevance}(m, q) is the cosine similarity between memory embedding \vec{m} and query embedding \vec{q}.
  • \alpha, \beta, \gamma are tunable weights.

Scoring Factors

3. Reflection: Building Higher-Level Understanding

“Agents, when equipped only with raw observational data, struggle to generalize or make inferences.”

Raw experience logs aren’t enough for deeper understanding. The reflection mechanism prompts the LLM periodically: “Look at your recent memories. What are the key takeaways? What patterns do you see?” This generates higher-level insights:

“Klaus Mueller seems really passionate about his research.” “I seem to enjoy organizing parties.”

These reflections are then stored back into the memory stream, usually tagged with high importance. This is how agents are supposed to develop consistent character traits, opinions, and understandings of relationships over time. It’s an attempt to distill meaning from the noise.

Reflection Process

4. Planning: Maintaining Temporal Coherence

“Plans help agents create daily schedules and maintain coherence over long horizons.”

LLMs are great at improvising the next sentence, terrible at sticking to a plan made hours ago. The planning module tries to impose some temporal discipline. Agents generate rough daily schedules, prompted by the LLM:

1) 7:00–7:30 AM: Wake up, routine.
2) 7:30–8:00 AM: Breakfast.
3) 8:00–9:00 AM: Open the pharmacy.
...

These plans act as guardrails. Agents can break them down into smaller actions and react to immediate events (like the burning stove), potentially triggering a replan. But the existence of the plan helps prevent total incoherence – the agent who planned to go to the 5 PM party is less likely to randomly decide to be somewhere else then.

Planning System

5. Action Generation: Responding to the World

Putting it all together: when an agent needs to act, the system feeds the LLM a prompt containing:

  1. The current perception (what’s happening).
  2. The top retrieved memories.
  3. The relevant part of the current plan.

The LLM then generates the action or dialogue line. If Agent 1 says something, that becomes a perception event for Agent 2, triggering Agent 2’s retrieve/plan/act cycle, leading to a response. This back-and-forth creates the conversations and interactions.

sequenceDiagram diagram


Emergent Social Dynamics: What Researchers Discovered

Running Smallville with 25 agents revealed some behaviors that weren’t explicitly programmed:

1. Information Spread

Like the Valentine’s party, news of Sam Moore running for mayor spread organically. Starting with only Sam knowing, the info percolated through conversations, reaching about a third of the town within two simulated days. A basic demonstration of information diffusion.

2. Relationship Formation

Agents started forming connections. The paper measured “network density”—pairs who knew each other—increasing significantly. Agents who hung out together or shared interests seemed to form stronger bonds (or at least, their LLM prompts reflected this). Relationships bloomed from simulated proximity and interaction.

Relationship Evolution

3. Social Coordination

The Valentine’s party again. Isabella hosted, invited friends, they spread the word. Five agents actually showed up. Others got invited but were “busy” or “forgot”—failures in planning or memory retrieval. This imperfect coordination felt more plausible than perfect attendance, mimicking the sometimes-flaky reality of social planning.

Importance of the Architecture Components

The researchers ran ablation studies (removing components) and “interviewed” the agents. The results suggested each piece mattered:

  • Full Architecture: Agents gave coherent answers referencing past events and future plans.
  • No Reflection: Agents struggled to form stable opinions or generalizations.
  • No Planning: Agents were more scattered, lacking long-term goal consistency.
  • No Memory: Agents became generic chatbots with no personal history.

The scaffolding, it seems, is doing real work in creating the illusion of a persistent self.


Challenges and Limitations

Let’s not get carried away. This is clever engineering, but it’s far from sentient AI. The cracks show:

Memory and Retrieval Limitations

  • Scalability: The memory stream grows linearly. Retrieval gets harder and slower. This won’t scale to long lifetimes without better summarization or pruning.
  • Retrieval Failures: The heuristic is imperfect. Crucial memories can be missed if they don’t score well on recency/importance/relevance. Agents might forget promises or repeat mistakes. Semantic similarity isn’t always contextual relevance.
  • The Achilles’ Heel: The context window limit of the LLM remains the fundamental bottleneck. All this architecture is elaborate scaffolding around that core limitation.

Reflection Quality Issues

  • Garbage In, Garbage Out: Reflections are only as good as the LLM’s ability to synthesize. Weak reasoning leads to shallow or contradictory insights.
  • Stability: If the environment changes drastically, old reflections might become nonsensical anchors.
  • Lack of True Understanding: Reflections are still pattern-matching, not deep comprehension.

Planning Constraints

  • LLM Reasoning Flaws: LLMs are not brilliant logicians or planners. Schedules can have conflicts or nonsensical steps. Complex, multi-day planning is likely beyond current models.
  • Adaptability: While agents can replan, their ability to adapt intelligently to major disruptions is limited by the LLM’s common sense (or lack thereof).

Behavioral Authenticity

  • The Politeness Problem: LLMs trained on internet data and fine-tuned for helpfulness often make agents too nice, too agreeable. Real human interaction involves conflict, disagreement, irrationality. Smallville seems a bit too utopian. The agents lack… edge.
  • Hallucination: The system can still confabulate details or misremember events, just like any LLM.

Implementation: A Code Glimpse

Here’s a simplified Python sketch showing the conceptual structure of the memory and retrieval part. Don’t expect production-ready code, but it illustrates the flow:

import time
import numpy as np
from typing import List, Tuple
# Assume existence of LangChain-like classes
# from langchain import EmbeddingModel, LLM 

# Placeholder classes for demonstration
class EmbeddingModel:
    def embed_text(self, text: str) -> np.ndarray:
        # Replace with actual embedding model call
        print(f"Embedding: '{text[:30]}...'") # Debug print
        return np.random.rand(768) # Example dimension

class LLM:
    def __call__(self, prompt: str) -> str:
        # Replace with actual LLM API call
        print(f"LLM Prompt: '{prompt[:100]}...'") # Debug print
        if "Rate the significance" in prompt:
            return "7" # Dummy importance score
        if "Generate 1-3 high-level insights" in prompt:
            return "I seem to spend a lot of time at the pharmacy.\nCustomers ask about Sam Moore often." # Dummy reflection
        return "LLM Response Placeholder"

embedding_model = EmbeddingModel() 
llm = LLM() 

def cosine_similarity(v1, v2):
    """Calculate cosine similarity between two vectors"""
    # Ensure vectors are numpy arrays
    v1 = np.asarray(v1)
    v2 = np.asarray(v2)
    if v1.shape != v2.shape:
        print(f"Warning: Shape mismatch {v1.shape} vs {v2.shape}")
        return 0.0 # Or handle error appropriately

    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    if norm_v1 == 0 or norm_v2 == 0:
        return 0.0
    return dot_product / (norm_v1 * norm_v2)

class MemoryItem:
    def __init__(self, text: str, timestamp: float, importance: float = 1.0):
        self.text = text
        self.timestamp = timestamp
        # Ensure importance is within a reasonable range, e.g., [0, 1]
        self.importance = max(0.0, min(1.0, importance)) 
        self.last_access = timestamp
        # Cache the embedding to avoid recomputing
        self.embedding = None # Needs to be np.ndarray or similar
    
    def get_embedding(self) -> np.ndarray:
        """Lazy load the embedding when needed"""
        if self.embedding is None:
            # Ensure the embedding call returns a numpy array
            self.embedding = np.array(embedding_model.embed_text(self.text))
        # Ensure it's always returned as a numpy array
        return np.asarray(self.embedding)


class Agent:
    def __init__(self, name: str, initial_memories: List[str]):
        self.name = name
        # A list of MemoryItem
        self.memory_stream: List[MemoryItem] = []
        # Weights for the memory retrieval scoring function
        self.recency_weight = 0.3
        self.importance_weight = 0.3
        self.relevance_weight = 0.4
        # Exponential decay factor for recency
        self.recency_decay_lambda = 0.005 # Per hour decay rate

        # Populate initial memories with current time
        for mem_text in initial_memories:
            # Get importance requires self, call after basic init
            importance_score = self.get_importance(mem_text) 
            mem_item = MemoryItem(
                text=mem_text,
                timestamp=time.time(),
                importance=importance_score 
            )
            self.memory_stream.append(mem_item)
    
    def get_importance(self, text: str) -> float:
        """Ask the LLM to rate the importance of a memory"""
        prompt = f"Rate the significance of this memory for {self.name} on a scale 1-10:\nMemory: {text}"
        response = llm(prompt)
        try:
            # Attempt to parse integer, then scale to [0, 1]
            score = float(int(response.strip())) / 10.0
            # Clamp score between 0 and 1
            return max(0.0, min(1.0, score))
        except ValueError:
             # Handle cases where LLM response isn't a number
            print(f"Warning: Could not parse importance score from LLM response: '{response}'")
            return 0.1  # Default low importance if parsing fails
    
    def remember(self, new_observation: str) -> None:
        """Add a new memory to the memory stream"""
        now = time.time()
        importance = self.get_importance(new_observation)
        item = MemoryItem(new_observation, now, importance)
        self.memory_stream.append(item)
    
    def retrieve_relevant_memories(self, query: str, limit: int = 5) -> List[MemoryItem]:
        """Retrieve memories relevant to the current query"""
        # 1) compute embedding of query
        query_emb = embedding_model.embed_text(query)
        
        # 2) score each memory
        memory_scores: List[Tuple[float, MemoryItem]] = []
        now = time.time()
        if not self.memory_stream: # Handle empty memory stream
            return []

        for mem in self.memory_stream:
            # recency factor (exponential decay)
            # Ensure last_access is a float
            hours_elapsed = (now - float(mem.last_access)) / 3600.0
            recency_score = np.exp(-self.recency_decay_lambda * hours_elapsed) 
            
            # importance score (already normalized to [0,1])
            imp_score = mem.importance
            
            # semantic relevance (cosine similarity)
            mem_emb = mem.get_embedding()
            # Ensure query_emb is also a numpy array for cosine_similarity
            relevance_score = cosine_similarity(np.asarray(query_emb), mem_emb)
            
            # Weighted sum of factors
            final_score = (
                self.recency_weight * recency_score +
                self.importance_weight * imp_score +
                self.relevance_weight * relevance_score
            )
            memory_scores.append((final_score, mem))
        
        # sort by final_score desc
        memory_scores.sort(key=lambda x: x[0], reverse=True)
        
        # pick top N memories 
        # Adjust limit if fewer memories exist than requested
        actual_limit = min(limit, len(memory_scores))
        top_memories = [m[1] for m in memory_scores[:actual_limit]]
        
        # Update last_access time for retrieved memories
        for mem in top_memories:
            mem.last_access = now
        
        return top_memories
    
    def reflect(self, recent_count: int = 10) -> None:
        """
        Form higher-level insights from recent memories and add them as new memories.
        """
        if len(self.memory_stream) < recent_count:
            print(f"Agent {self.name}: Not enough memories ({len(self.memory_stream)}) to reflect.")
            return  # Not enough memories to reflect on
            
        # Get the most recent n memories (ensure sorting is correct)
        recent_memories = sorted(
            self.memory_stream, 
            key=lambda x: x.timestamp, 
            reverse=True
        )[:recent_count]
        
        # Combine the texts for the prompt
        memory_texts = [f"- {mem.text}" for mem in recent_memories]
        memory_context = "\n".join(memory_texts)
        
        # Ask LLM to generate reflections
        prompt = f"""
        Based on these recent memories of {self.name}:
        
        {memory_context}
        
        Generate 1-3 high-level insights or reflections that {self.name} might form.
        Each reflection should be a single sentence that captures a pattern, conclusion, or realization. Focus on observations about self or others.
        """
        
        reflections_raw = llm(prompt).strip()
        # Split reflections, handling potential variations in LLM output
        reflections = [r.strip() for r in reflections_raw.split('\n') if r.strip()]

        print(f"Agent {self.name} reflected: {reflections}") # Debug print
        
        # Add reflections as new high-importance memories
        now = time.time()
        for reflection in reflections:
            if reflection:  # Ensure reflection text is not empty
                # Prefix reflection to clarify its nature
                ref_text = f"{self.name} reflected: {reflection.strip('- ')}" 
                # Reflections are typically more important, assign higher fixed importance
                self.remember(ref_text) # Use remember to handle importance scoring etc.


    def generate_plan(self, day_start: float, day_end: float) -> List[str]:
        """
        Generate a daily plan based on agent's memories and goals.
        Returns a list of planned activities with time slots. (Conceptual)
        """
        print(f"Agent {self.name}: Generating plan...")
        # In a real implementation, this would involve:
        # 1. Retrieving relevant memories about goals, commitments, routines.
        # 2. Prompting the LLM with these memories and constraints.
        # 3. Parsing the LLM output into a structured plan.
        # Implementation omitted for brevity
        return ["7:00 AM: Wake up", "8:00 AM: Open Pharmacy", "5:00 PM: Close Pharmacy"] # Dummy plan


# Example usage
agent = Agent(
    name="John Lin",
    initial_memories=[
        "John Lin is a friendly pharmacist.",
        "John Lin is married to Mei Lin, a professor.",
        "John Lin lives in Smallville and runs the pharmacy."
    ]
)

# Simulate an event
agent.remember("John Lin saw Sam Moore at the park around noon and they chatted briefly about the upcoming town election.")
agent.remember("A customer asked John Lin about allergy medication.")

# Retrieve memories relevant to Sam Moore
top_mem = agent.retrieve_relevant_memories("Sam Moore", limit=3)
print(f"\nRelevant memories about Sam Moore for {agent.name}:")
if top_mem:
    for m in top_mem:
        print(f"- {m.text} (Importance: {m.importance:.2f}, Accessed: {time.strftime('%H:%M:%S', time.localtime(m.last_access))})")
else:
    print("None found.")


# Trigger reflection periodically (e.g., after 10 memories)
# Simulate adding more memories to trigger reflection
for i in range(8):
    agent.remember(f"John Lin completed task {i+1} at the pharmacy.")

if len(agent.memory_stream) >= 10:
     print(f"\nTriggering reflection for {agent.name}...")
     agent.reflect(recent_count=10)

print(f"\nTotal memories for {agent.name}: {len(agent.memory_stream)}")

Applications and Future Directions

So, what’s this good for, beyond academic curiosity?

Simulation for Social System Prototyping

This is the most obvious angle. Instead of guessing how people might react to a new policy, urban design, or social platform feature, you could simulate it first. It’s a kind of digital crystal ball for social science and design:

  • Test environmental tweaks (if the park closes early, does cafe traffic increase?)
  • Model group dynamics (what happens if you introduce agents with conflicting goals?)
  • Prototype game mechanics or community rules before unleashing them on real humans.

Social System Prototyping

Enhanced Agent Cognition

The current architecture is clever, but brittle. Future work needs to shore up the foundations:

  • Better Memory: Hierarchical structures, summarization, maybe even forgetting mechanisms to prevent infinite growth.
  • Meta-Cognition: Agents that can reason about why they believe something or how they reached a conclusion. A long shot, but interesting.
  • Smarter Planning: Handling complex dependencies, long-term goals, and real-world constraints better than current LLMs can.
  • Logical Consistency: Using techniques like chain-of-thought or self-correction to reduce nonsensical actions or contradictory beliefs.

Grounding and Reality Constraints

Agents making stuff up (hallucinating) is a problem. Tying them more closely to reality could involve:

  • Knowledge Graphs: Constraining beliefs or statements against verified facts.
  • Physics Engines: Basic physical rules could prevent agents from teleporting or walking through walls (unless intended).
  • Fact-Checking Loops: Internal mechanisms to verify generated statements before they become “memories.”

Ethical Considerations and Responsible Development

As these simulations get more believable, the ethical tightrope gets thinner:

  • Parasocial Relationships: Don’t let people get emotionally attached to convincingly fake agents. Clear labeling is crucial.
  • Monitoring and Auditing: Keep logs. Understand what these agents are “learning” and doing to catch problematic emergent behavior (bias amplification, harmful strategies).
  • Human in the Loop: Don’t automate everything. Human oversight in development and deployment, especially for sensitive applications, is non-negotiable.
  • Bias: Ensure agent populations don’t just replicate and amplify existing societal stereotypes baked into the LLM’s training data.

Conclusion: The Future of Digital Simulacra

The Stanford Generative Agents work is a fascinating piece of engineering. It’s a significant step towards creating digital entities that feel persistent. They remember (crudely), they plan (badly), they reflect (superficially), and they interact in ways that generate surprisingly complex social phenomena. It’s the architecture—the interplay of memory, retrieval, reflection, and planning around the core LLM—that creates this illusion of continuity.

Let’s be clear: this isn’t AGI. These agents don’t understand in any deep sense. They are sophisticated mimics running on statistical patterns. The limitations are profound, rooted in the inherent weaknesses of today’s LLMs regarding reasoning, long-term memory, and genuine common sense.

But the trajectory is interesting. It moves beyond simple request-response bots towards systems that maintain state and identity over time. As the underlying models improve, these architectures will likely yield agents with richer internal lives and more complex social interactions.

The potential applications are broad – from smarter NPCs in games to tools for social science research, education, and maybe even therapeutic simulations. But the core achievement here is demonstrating a viable path for structuring LLMs to simulate persistent, socially embedded agents. It’s a compelling glimpse into how we might build more believable ghosts in the digital machine, forcing us to confront, yet again, what intelligence and agency truly mean as we build increasingly complex reflections of ourselves.

Posted in AI / ML, LLM Intermediate, LLM Research