Unboxing LLMs > loading...

February 20, 2024

In-Context Learning: The Hidden Superpower of Large Language Models

In-Context Learning: The Hidden Superpower of Large Language Models

Introduction

The field of artificial intelligence has witnessed remarkable advancements in recent years, but few capabilities have been as surprising and powerful as In-Context Learning (ICL). This phenomenon—where language models appear to “learn” from nothing more than examples in their prompt—challenges our traditional understanding of machine learning and opens new possibilities for AI applications.

In this deep dive, we’ll explore how modern large language models (LLMs) can adapt to new tasks without changing a single weight, examine the science behind this capability, and discover why it matters for the future of AI. Whether you’re an AI researcher, developer, or simply curious about cutting-edge technology, understanding ICL provides crucial insight into the rapidly evolving landscape of artificial intelligence.

1. Overview and Definition

In-Context Learning (ICL) is the capability of large language models (LLMs) to adapt to a new task simply by reading a prompt. Crucially, the model’s parameters remain unchanged—no additional gradient-based updates or fine-tuning steps occur. Instead, you provide instructions or examples within the prompt (i.e., within a sequence of tokens), and the model processes them at inference time to generate outputs consistent with that guidance.

Two Illustrative “Learning” Curves

When discussing ICL, it helps to contrast two different “learning” processes:

  1. Conventional Training Curve (Parametric Learning)
    • Spans thousands of optimization steps (e.g., 8,000 or more).
    • Involves gradient updates, backpropagation, and changing internal weights.
    • Results in permanent changes to the model’s behavior across all inputs.
  2. In-Context Learning Curve (Non-Parametric Adaptation)
    • Spans tens to hundreds of tokens (e.g., 50–500 tokens) supplied to the model during inference.
    • Involves no changes to the model’s internal weights, yet the model “adapts” to the user-provided examples or instructions.
    • Adaptation is temporary and limited to the current conversation or prompt session.

Visually, the conventional training curve might show a gradual decrease in loss over many training steps, whereas the in-context learning curve focuses on how the model’s predictions improve (its loss decreases) over the tokens in a single prompt.

Types of In-Context Learning

ICL typically manifests in three primary forms:

  • Zero-shot learning: The model performs a task with just instructions, no examples.
  • One-shot learning: The model sees a single example before attempting a task.
  • Few-shot (n-shot) learning: The model observes multiple examples (typically 2-10) before performing a task.

As we progress from zero-shot to few-shot, performance typically improves—though the gains diminish with each additional example due to both context window limitations and diminishing returns.

Types of In-Context Learning

2. Why It’s More Than Simple Conditioning

A natural skepticism is: “Isn’t in-context learning just conditioning on text rather than truly learning?” The distinction is subtle but important:

  • Mere conditioning: The model’s output distribution is guided by the prompt, but it does not necessarily learn a new concept.
  • ICL / Meta-Learning: The model can generalize a newly introduced pattern or relationship on the fly and use it for new inputs within the same prompt context.

For instance, you could provide a large language model with a few examples in a language it has never (or very rarely) seen. Surprisingly, it can still deduce rudimentary syntax or respond in that language—showing an unexpected level of “meta-learning” rather than mere parroting. The phenomenon is especially evident with few-shot prompts, where the model sees just a handful of examples but performs the new task with reasonable competence.

A Concrete Example

Consider this example where we introduce a made-up “flibber” language rule to GPT-4:

In the flibber language, you add "zub" after every word ending with a vowel and "mek" after every word ending with a consonant.

Example 1: "hello" becomes "hellomek"
Example 2: "pizza" becomes "pizzazub"

Now translate: "computer code"

Despite never being trained on “flibber,” a capable LLM can generally infer the pattern and produce “computermek codezub”—a demonstration of genuine on-the-fly pattern recognition rather than simple pattern matching.

3. ICL as a Benchmark for Meta-Learning

Because ICL requires no parameter updates, it provides a powerful benchmark for how well an LLM can “learn to learn.” Specifically:

  1. Few-Shot Classification
    Supply a handful of labeled examples, then see if the model correctly labels a new input in the same prompt.
  2. Chain-of-Thought Reasoning
    Show the model a sample of reasoning steps (“thinking aloud”) for a certain problem, then give a new problem. Does it replicate the reasoning style and arrive at a correct solution?
  3. Task Instructions or Styles
    Provide a short set of instructions and examples in a new domain (e.g., a rarely used text style) and check if the model correctly imitates or generalizes that style.

Large-scale studies (e.g., GPT-3, GPT-4, PaLM, Claude) find that bigger, more sophisticated models handle these tasks with fewer examples and can produce higher-quality outputs—indicating that meta-learning capabilities improve with model scale.

The Relationship Between ICL and Foundation Model Training

ICL ability isn’t an accident. It emerges from the pretraining procedure of modern LLMs, which often involves:

  • Next-token prediction: The model learns to predict what comes next in a sequence.
  • Massive dataset exposure: It sees countless examples of humans explaining, demonstrating, and following patterns.
  • Implicit meta-learning: The pretraining objective indirectly rewards the ability to quickly adapt to patterns within a text sequence.

This suggests that ICL is an emergent property of models trained to predict text in a way that mirrors how humans process information in context.

4. Sudden or “Phase Transition” Improvements

Another intriguing phenomenon observed in LLMs is the step-like or phase-transition improvement. As a model is trained across many steps:

  • Certain capabilities (e.g., multi-step arithmetic, advanced reasoning, few-shot adaptation) emerge abruptly, rather than slowly climbing in performance.
  • You may see a “cliff” in the difference between losses with fewer vs. more tokens in a prompt—like an abrupt drop in the model’s perplexity once enough context tokens are provided.

This is reminiscent of emergent behavior: water freezing to ice at 0 °C is a classic example of a phase transition. LLMs can exhibit surprising leaps in in-context learning at certain scales or training checkpoints.

Phase Transitions in LLM Capabilities

Scale Thresholds for Emergence

Research suggests that certain ICL capabilities require minimum threshold model sizes:

  • Basic few-shot learning may emerge around 1-10B parameters
  • Complex reasoning through ICL often requires 100B+ parameters
  • Sophisticated meta-learning abilities continue to improve even beyond 1T parameters

These thresholds aren’t absolute but represent general trends observed across multiple model families and architectures.

5. Measuring and Quantifying In-Context Learning

5.1 Loss vs. Tokens (Prompt Length)

One way to measure in-context learning is to track:

  • LaTeX: \text{Loss@}N = the model’s average log-likelihood loss after reading LaTeX: N tokens of a prompt.
  • Compare LaTeX: \text{Loss@}50 vs. LaTeX: \text{Loss@}500 (for example).

If LaTeX: \text{Loss@}500 is much lower than LaTeX: \text{Loss@}50, it means that by seeing those additional 450 tokens (presumably containing instructions or examples), the model has significantly improved its predictions. A large improvement suggests a strong in-context learning effect.

Loss vs. Prompt Length

5.2 Bits or Nats per Token Gains

Sometimes, researchers quantify these gains using bits or nats of information per token. For instance:

  • A LaTeX: 0.4LaTeX: 0.5 bit improvement per token might imply that for every token the model reads in the demonstration, it’s picking a better next token about half a bit more confidently—a sign of genuine adaptation from the prompt.

5.3 Task-Specific Evaluation Metrics

Beyond perplexity and loss-based metrics, researchers also employ task-specific metrics:

  • Accuracy improvement: How much does accuracy on a task improve with additional examples?
  • Learning rate: How quickly does performance improve with each additional example?
  • Transfer ability: Can the model apply patterns learned from one set of examples to semantically similar but structurally different problems?

These measures help disentangle true in-context learning from simple pattern completion.

6. Implementation Example (Code Snippet)

Below is a conceptual snippet (in Python) that illustrates how one might measure the in-context learning improvement in a simple classification scenario. It uses a downloadable model instead of an API model, and improves the implementation details.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Use a downloadable model instead of an API model
model_name = "gpt2-large"  # Replace with any downloadable model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def measure_in_context_learning(prompt_examples, new_input, expected_output):
    """
    1. Compute log-likelihood (loss) for the new_input with minimal context
    2. Compute log-likelihood (loss) for the same new_input with the 'prompt_examples' included
    3. Return the difference
    """

    # 1) Minimal context
    minimal_context = f"Classify the sentiment (Positive or Negative).\nInput: {new_input}\nOutput:"
    tokens_minimal = tokenizer.encode(minimal_context, return_tensors="pt")
    expected_tokens = tokenizer.encode(expected_output, add_special_tokens=False)
    
    with torch.no_grad():
        outputs_minimal = model(tokens_minimal)
        logits_minimal = outputs_minimal.logits[:, -1, :]  # Get logits for last token
        
        # Calculate loss for expected tokens
        loss_minimal = 0
        for i, token_id in enumerate(expected_tokens):
            token_logits = logits_minimal if i == 0 else model(
                torch.cat([tokens_minimal, torch.tensor([[token_id]])], dim=1)
            ).logits[:, -1, :]
            log_prob = torch.log_softmax(token_logits, dim=-1)[:, token_id]
            loss_minimal -= log_prob.item()
        loss_minimal /= len(expected_tokens)  # Average loss per token

    # 2) With few-shot examples (in-context examples)
    full_prompt = prompt_examples + f"\n\nInput: {new_input}\nOutput:"
    tokens_full = tokenizer.encode(full_prompt, return_tensors="pt")
    
    with torch.no_grad():
        outputs_full = model(tokens_full)
        logits_full = outputs_full.logits[:, -1, :]
        
        # Calculate loss for expected tokens with full prompt
        loss_full = 0
        for i, token_id in enumerate(expected_tokens):
            token_logits = logits_full if i == 0 else model(
                torch.cat([tokens_full, torch.tensor([[token_id]])], dim=1)
            ).logits[:, -1, :]
            log_prob = torch.log_softmax(token_logits, dim=-1)[:, token_id]
            loss_full -= log_prob.item()
        loss_full /= len(expected_tokens)

    return loss_minimal, loss_full, (loss_minimal - loss_full)

if __name__ == "__main__":
    # Some demonstration examples
    few_shot_examples = """Classify the sentiment (Positive or Negative).
Input: I love this product.
Output: Positive

Input: This meal is terrible.
Output: Negative
"""

    new_input_example = "The camera quality is absolutely stunning!"
    expected_output = " Positive"

    loss_min, loss_full, diff = measure_in_context_learning(
        prompt_examples=few_shot_examples,
        new_input=new_input_example,
        expected_output=expected_output
    )

    print(f"Loss (minimal context): {loss_min:.3f}")
    print(f"Loss (with in-context examples): {loss_full:.3f}")
    print(f"Improvement (loss_min - loss_full): {diff:.3f}")
  • A larger difference (loss_min - loss_full) indicates that including the few-shot examples in the prompt leads to a stronger improvement in the model’s performance, reflecting in-context learning.

Practical Implementation Considerations

When implementing ICL in production systems, consider these factors:

  • Token budget: ICL examples consume your context window, so balance the number of examples against other content.
  • Example selection: Choose diverse, representative examples that cover edge cases.
  • Example ordering: The order of examples can significantly impact performance; recent examples often have more influence.
  • Prompt structure: Clear, consistent formatting improves the model’s ability to extract patterns.

7. Practical Applications of In-Context Learning

ICL has already transformed how we build AI applications across multiple domains:

7.1 Personalization Without Fine-Tuning

ICL enables personalized experiences without creating custom models: – Writing style adaptation: Show examples of a user’s writing style, and the model adapts its outputs accordingly. – Preference learning: Include examples of a user’s preferences in prompts to guide responses. – Domain customization: Add domain-specific examples to help models operate in specialized fields like medicine, law, or engineering.

7.2 Rapid Prototyping

Develop and test new AI capabilities without model training: – Custom classifiers: Create text classifiers for specific needs with just a few examples. – Format converters: Transform data between formats by demonstrating the conversion pattern. – Domain-specific tools: Build specialized tools without collecting massive datasets or fine-tuning models.

7.3 Educational Applications

ICL enables adaptive learning experiences: – Custom tutoring: Models can adapt their teaching approach based on examples of successful explanations for a specific student. – Learning assessment: Evaluate understanding by asking students to complete patterns established through examples. – Curriculum development: Generate educational content following demonstrated pedagogical approaches.

7.4 API Design

Many modern AI APIs are designed around ICL principles: – OpenAI’s function calling: Relies on in-context understanding of function specifications. – Anthropic’s Claude templates: Leverages ICL to structure complex multi-turn interactions. – Multi-modal prompting: Combines text and images to establish patterns the model should follow.

Domain Traditional Approach ICL Approach Key Benefits
Personalization Fine-tune model per user Include user examples in prompt No model per user, instant updates
Content Creation Train on domain corpus Show style examples in prompt Quick adaptation to new styles
Classification Train classifier on labeled data Provide few examples in prompt No training pipeline, fast iteration
Data Transformation Build conversion rules Show input→output examples Handles edge cases, no explicit rules
Education Static content delivery Adapt to learning style with examples Personalized explanations
Customer Support Script responses Show tone/policy examples Consistent yet flexible responses

8. Recent Advances and Future Directions

8.1 Larger Models, Better ICL

Scaling model size usually yields more pronounced in-context learning gains. The GPT-3.5 and GPT-4 families, for example, handle complex tasks with fewer examples needed in the prompt.

8.2 Prompting Techniques

New prompting strategies—such as chain-of-thought, self-consistency, retrieval-augmented generation (RAG), and instruction prompting—further enhance in-context learning. Larger models often respond positively to these techniques.

8.3 Lightweight Fine-Tuning

Though pure ICL requires no parameter updates, there are hybrid methods like LoRA (Low-Rank Adaptation), QLoRA, and Prefix Tuning that fine-tune only a small subset of parameters or “virtual tokens.” This can bolster in-context learning abilities without full retraining.

8.4 Emergence Research

Ongoing research focuses on explaining why LLMs exhibit these emergent properties and abrupt changes. Hypotheses include: – Internal circuits that mimic gradient descent-like processes.
– Sparse “expert” modules that activate at scale.
– Hidden states that reorganize once a critical capacity threshold is reached. – Attention mechanisms that function as associative memory.

8.5 Context Window Expansion

Recent developments in extending context windows (from 2K tokens to 32K, 128K, and beyond) have significant implications for ICL: – More examples can be included in a single prompt. – Models can maintain context across longer interactions. – Complex patterns requiring more demonstration space become viable.

8.6 Multimodal ICL

The newest frontier is applying ICL principles across modalities: – Vision-language models adapting to visual styles from examples. – Audio models learning to mimic speech patterns or accents. – Code models inferring programming patterns from few examples.

9. Conclusion

In-Context Learning showcases a remarkable capability of modern large language models: they can absorb new information, patterns, or instructions simply by reading additional text—no weight updates needed. This goes beyond naive conditioning to something resembling meta-learning, where the model generalizes from prompt examples to new instances within the same context.

Studying Loss@N tokens, measuring bits/nats per token gained, and experimenting with few-shot prompts all shed light on the extent of a model’s in-context learning prowess. Phase transitions or cliffs in performance further illustrate how these capacities can emerge at scale.

ICL has revolutionized how we interact with AI systems, enabling rapid adaptation to specific needs without the traditional machine learning pipeline of data collection, training, and deployment. It represents a fundamental shift in AI application development—from engineering models to engineering prompts.

As models continue to scale and research progresses, ICL capabilities will likely become even more sophisticated. The boundary between “learning” and “prompting” may continue to blur, potentially creating AI systems that can truly learn new skills from minimal human guidance.


Further Reading

  1. Language Models are Few-Shot LearnersBrown et al., 2020 (Introduced GPT-3).
  2. Chain-of-Thought PromptingWei et al., 2022 (Demonstrates chain-of-thought reasoning in LLMs).
  3. Emergent Abilities of Large Language ModelsWei et al., 2022.
  4. Model-Agnostic Meta-Learning (MAML)Finn et al., 2017 (Classic meta-learning approach).
  5. What Learning Algorithm Is In-Context Learning? Investigations With Linear ModelsAkyürek et al., 2023.
  6. Transformers as Algorithms: Generalization and Stability in In-context LearningLi et al., 2023.

By understanding and experimenting with in-context learning, one gains insight into how large language models can function as versatile, on-the-fly problem solvers—just by reading carefully crafted textual prompts.

Posted in AI / ML, LLM Fundamentals
Write a comment