In-Context Learning: The Hidden Superpower of Large Language Models
Introduction
The field of artificial intelligence has witnessed remarkable advancements in recent years, but few capabilities have been as surprising and powerful as In-Context Learning (ICL). This phenomenon—where language models appear to “learn” from nothing more than examples in their prompt—challenges our traditional understanding of machine learning and opens new possibilities for AI applications.
In this deep dive, we’ll explore how modern large language models (LLMs) can adapt to new tasks without changing a single weight, examine the science behind this capability, and discover why it matters for the future of AI. Whether you’re an AI researcher, developer, or simply curious about cutting-edge technology, understanding ICL provides crucial insight into the rapidly evolving landscape of artificial intelligence.
1. Overview and Definition
In-Context Learning (ICL) is the capability of large language models (LLMs) to adapt to a new task simply by reading a prompt. Crucially, the model’s parameters remain unchanged—no additional gradient-based updates or fine-tuning steps occur. Instead, you provide instructions or examples within the prompt (i.e., within a sequence of tokens), and the model processes them at inference time to generate outputs consistent with that guidance.
Two Illustrative “Learning” Curves
When discussing ICL, it helps to contrast two different “learning” processes:
- Conventional Training Curve (Parametric Learning)
- Spans thousands of optimization steps (e.g., 8,000 or more).
- Involves gradient updates, backpropagation, and changing internal weights.
- Results in permanent changes to the model’s behavior across all inputs.
- In-Context Learning Curve (Non-Parametric Adaptation)
- Spans tens to hundreds of tokens (e.g., 50–500 tokens) supplied to the model during inference.
- Involves no changes to the model’s internal weights, yet the model “adapts” to the user-provided examples or instructions.
- Adaptation is temporary and limited to the current conversation or prompt session.
Visually, the conventional training curve might show a gradual decrease in loss over many training steps, whereas the in-context learning curve focuses on how the model’s predictions improve (its loss decreases) over the tokens in a single prompt.
Types of In-Context Learning
ICL typically manifests in three primary forms:
- Zero-shot learning: The model performs a task with just instructions, no examples.
- One-shot learning: The model sees a single example before attempting a task.
- Few-shot (n-shot) learning: The model observes multiple examples (typically 2-10) before performing a task.
As we progress from zero-shot to few-shot, performance typically improves—though the gains diminish with each additional example due to both context window limitations and diminishing returns.

2. Why It’s More Than Simple Conditioning
A natural skepticism is: “Isn’t in-context learning just conditioning on text rather than truly learning?” The distinction is subtle but important:
- Mere conditioning: The model’s output distribution is guided by the prompt, but it does not necessarily learn a new concept.
- ICL / Meta-Learning: The model can generalize a newly introduced pattern or relationship on the fly and use it for new inputs within the same prompt context.
For instance, you could provide a large language model with a few examples in a language it has never (or very rarely) seen. Surprisingly, it can still deduce rudimentary syntax or respond in that language—showing an unexpected level of “meta-learning” rather than mere parroting. The phenomenon is especially evident with few-shot prompts, where the model sees just a handful of examples but performs the new task with reasonable competence.
A Concrete Example
Consider this example where we introduce a made-up “flibber” language rule to GPT-4:
In the flibber language, you add "zub" after every word ending with a vowel and "mek" after every word ending with a consonant.
Example 1: "hello" becomes "hellomek"
Example 2: "pizza" becomes "pizzazub"
Now translate: "computer code"
Despite never being trained on “flibber,” a capable LLM can generally infer the pattern and produce “computermek codezub”—a demonstration of genuine on-the-fly pattern recognition rather than simple pattern matching.
3. ICL as a Benchmark for Meta-Learning
Because ICL requires no parameter updates, it provides a powerful benchmark for how well an LLM can “learn to learn.” Specifically:
- Few-Shot Classification
Supply a handful of labeled examples, then see if the model correctly labels a new input in the same prompt. - Chain-of-Thought Reasoning
Show the model a sample of reasoning steps (“thinking aloud”) for a certain problem, then give a new problem. Does it replicate the reasoning style and arrive at a correct solution? - Task Instructions or Styles
Provide a short set of instructions and examples in a new domain (e.g., a rarely used text style) and check if the model correctly imitates or generalizes that style.
Large-scale studies (e.g., GPT-3, GPT-4, PaLM, Claude) find that bigger, more sophisticated models handle these tasks with fewer examples and can produce higher-quality outputs—indicating that meta-learning capabilities improve with model scale.
The Relationship Between ICL and Foundation Model Training
ICL ability isn’t an accident. It emerges from the pretraining procedure of modern LLMs, which often involves:
- Next-token prediction: The model learns to predict what comes next in a sequence.
- Massive dataset exposure: It sees countless examples of humans explaining, demonstrating, and following patterns.
- Implicit meta-learning: The pretraining objective indirectly rewards the ability to quickly adapt to patterns within a text sequence.
This suggests that ICL is an emergent property of models trained to predict text in a way that mirrors how humans process information in context.
4. Sudden or “Phase Transition” Improvements
Another intriguing phenomenon observed in LLMs is the step-like or phase-transition improvement. As a model is trained across many steps:
- Certain capabilities (e.g., multi-step arithmetic, advanced reasoning, few-shot adaptation) emerge abruptly, rather than slowly climbing in performance.
- You may see a “cliff” in the difference between losses with fewer vs. more tokens in a prompt—like an abrupt drop in the model’s perplexity once enough context tokens are provided.
This is reminiscent of emergent behavior: water freezing to ice at 0 °C is a classic example of a phase transition. LLMs can exhibit surprising leaps in in-context learning at certain scales or training checkpoints.

Scale Thresholds for Emergence
Research suggests that certain ICL capabilities require minimum threshold model sizes:
- Basic few-shot learning may emerge around 1-10B parameters
- Complex reasoning through ICL often requires 100B+ parameters
- Sophisticated meta-learning abilities continue to improve even beyond 1T parameters
These thresholds aren’t absolute but represent general trends observed across multiple model families and architectures.
5. Measuring and Quantifying In-Context Learning
5.1 Loss vs. Tokens (Prompt Length)
One way to measure in-context learning is to track:
= the model’s average log-likelihood loss after reading
tokens of a prompt.
- Compare
vs.
(for example).
If is much lower than
, it means that by seeing those additional 450 tokens (presumably containing instructions or examples), the model has significantly improved its predictions. A large improvement suggests a strong in-context learning effect.

5.2 Bits or Nats per Token Gains
Sometimes, researchers quantify these gains using bits or nats of information per token. For instance:
- A
–
bit improvement per token might imply that for every token the model reads in the demonstration, it’s picking a better next token about half a bit more confidently—a sign of genuine adaptation from the prompt.
5.3 Task-Specific Evaluation Metrics
Beyond perplexity and loss-based metrics, researchers also employ task-specific metrics:
- Accuracy improvement: How much does accuracy on a task improve with additional examples?
- Learning rate: How quickly does performance improve with each additional example?
- Transfer ability: Can the model apply patterns learned from one set of examples to semantically similar but structurally different problems?
These measures help disentangle true in-context learning from simple pattern completion.
6. Implementation Example (Code Snippet)
Below is a conceptual snippet (in Python) that illustrates how one might measure the in-context learning improvement in a simple classification scenario. It uses a downloadable model instead of an API model, and improves the implementation details.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Use a downloadable model instead of an API model
= "gpt2-large" # Replace with any downloadable model
model_name = AutoTokenizer.from_pretrained(model_name)
tokenizer = AutoModelForCausalLM.from_pretrained(model_name)
model
def measure_in_context_learning(prompt_examples, new_input, expected_output):
"""
1. Compute log-likelihood (loss) for the new_input with minimal context
2. Compute log-likelihood (loss) for the same new_input with the 'prompt_examples' included
3. Return the difference
"""
# 1) Minimal context
= f"Classify the sentiment (Positive or Negative).\nInput: {new_input}\nOutput:"
minimal_context = tokenizer.encode(minimal_context, return_tensors="pt")
tokens_minimal = tokenizer.encode(expected_output, add_special_tokens=False)
expected_tokens
with torch.no_grad():
= model(tokens_minimal)
outputs_minimal = outputs_minimal.logits[:, -1, :] # Get logits for last token
logits_minimal
# Calculate loss for expected tokens
= 0
loss_minimal for i, token_id in enumerate(expected_tokens):
= logits_minimal if i == 0 else model(
token_logits =1)
torch.cat([tokens_minimal, torch.tensor([[token_id]])], dim-1, :]
).logits[:, = torch.log_softmax(token_logits, dim=-1)[:, token_id]
log_prob -= log_prob.item()
loss_minimal /= len(expected_tokens) # Average loss per token
loss_minimal
# 2) With few-shot examples (in-context examples)
= prompt_examples + f"\n\nInput: {new_input}\nOutput:"
full_prompt = tokenizer.encode(full_prompt, return_tensors="pt")
tokens_full
with torch.no_grad():
= model(tokens_full)
outputs_full = outputs_full.logits[:, -1, :]
logits_full
# Calculate loss for expected tokens with full prompt
= 0
loss_full for i, token_id in enumerate(expected_tokens):
= logits_full if i == 0 else model(
token_logits =1)
torch.cat([tokens_full, torch.tensor([[token_id]])], dim-1, :]
).logits[:, = torch.log_softmax(token_logits, dim=-1)[:, token_id]
log_prob -= log_prob.item()
loss_full /= len(expected_tokens)
loss_full
return loss_minimal, loss_full, (loss_minimal - loss_full)
if __name__ == "__main__":
# Some demonstration examples
= """Classify the sentiment (Positive or Negative).
few_shot_examples Input: I love this product.
Output: Positive
Input: This meal is terrible.
Output: Negative
"""
= "The camera quality is absolutely stunning!"
new_input_example = " Positive"
expected_output
= measure_in_context_learning(
loss_min, loss_full, diff =few_shot_examples,
prompt_examples=new_input_example,
new_input=expected_output
expected_output
)
print(f"Loss (minimal context): {loss_min:.3f}")
print(f"Loss (with in-context examples): {loss_full:.3f}")
print(f"Improvement (loss_min - loss_full): {diff:.3f}")
- A larger difference (
loss_min - loss_full
) indicates that including the few-shot examples in the prompt leads to a stronger improvement in the model’s performance, reflecting in-context learning.
Practical Implementation Considerations
When implementing ICL in production systems, consider these factors:
- Token budget: ICL examples consume your context window, so balance the number of examples against other content.
- Example selection: Choose diverse, representative examples that cover edge cases.
- Example ordering: The order of examples can significantly impact performance; recent examples often have more influence.
- Prompt structure: Clear, consistent formatting improves the model’s ability to extract patterns.
7. Practical Applications of In-Context Learning
ICL has already transformed how we build AI applications across multiple domains:
7.1 Personalization Without Fine-Tuning
ICL enables personalized experiences without creating custom models: – Writing style adaptation: Show examples of a user’s writing style, and the model adapts its outputs accordingly. – Preference learning: Include examples of a user’s preferences in prompts to guide responses. – Domain customization: Add domain-specific examples to help models operate in specialized fields like medicine, law, or engineering.
7.2 Rapid Prototyping
Develop and test new AI capabilities without model training: – Custom classifiers: Create text classifiers for specific needs with just a few examples. – Format converters: Transform data between formats by demonstrating the conversion pattern. – Domain-specific tools: Build specialized tools without collecting massive datasets or fine-tuning models.
7.3 Educational Applications
ICL enables adaptive learning experiences: – Custom tutoring: Models can adapt their teaching approach based on examples of successful explanations for a specific student. – Learning assessment: Evaluate understanding by asking students to complete patterns established through examples. – Curriculum development: Generate educational content following demonstrated pedagogical approaches.
7.4 API Design
Many modern AI APIs are designed around ICL principles: – OpenAI’s function calling: Relies on in-context understanding of function specifications. – Anthropic’s Claude templates: Leverages ICL to structure complex multi-turn interactions. – Multi-modal prompting: Combines text and images to establish patterns the model should follow.
Domain | Traditional Approach | ICL Approach | Key Benefits |
---|---|---|---|
Personalization | Fine-tune model per user | Include user examples in prompt | No model per user, instant updates |
Content Creation | Train on domain corpus | Show style examples in prompt | Quick adaptation to new styles |
Classification | Train classifier on labeled data | Provide few examples in prompt | No training pipeline, fast iteration |
Data Transformation | Build conversion rules | Show input→output examples | Handles edge cases, no explicit rules |
Education | Static content delivery | Adapt to learning style with examples | Personalized explanations |
Customer Support | Script responses | Show tone/policy examples | Consistent yet flexible responses |
8. Recent Advances and Future Directions
8.1 Larger Models, Better ICL
Scaling model size usually yields more pronounced in-context learning gains. The GPT-3.5 and GPT-4 families, for example, handle complex tasks with fewer examples needed in the prompt.
8.2 Prompting Techniques
New prompting strategies—such as chain-of-thought, self-consistency, retrieval-augmented generation (RAG), and instruction prompting—further enhance in-context learning. Larger models often respond positively to these techniques.
8.3 Lightweight Fine-Tuning
Though pure ICL requires no parameter updates, there are hybrid methods like LoRA (Low-Rank Adaptation), QLoRA, and Prefix Tuning that fine-tune only a small subset of parameters or “virtual tokens.” This can bolster in-context learning abilities without full retraining.
8.4 Emergence Research
Ongoing research focuses on explaining why LLMs exhibit these emergent properties and abrupt changes. Hypotheses include: – Internal circuits that mimic gradient descent-like processes.
– Sparse “expert” modules that activate at scale.
– Hidden states that reorganize once a critical capacity threshold is reached. – Attention mechanisms that function as associative memory.
8.5 Context Window Expansion
Recent developments in extending context windows (from 2K tokens to 32K, 128K, and beyond) have significant implications for ICL: – More examples can be included in a single prompt. – Models can maintain context across longer interactions. – Complex patterns requiring more demonstration space become viable.
8.6 Multimodal ICL
The newest frontier is applying ICL principles across modalities: – Vision-language models adapting to visual styles from examples. – Audio models learning to mimic speech patterns or accents. – Code models inferring programming patterns from few examples.
9. Conclusion
In-Context Learning showcases a remarkable capability of modern large language models: they can absorb new information, patterns, or instructions simply by reading additional text—no weight updates needed. This goes beyond naive conditioning to something resembling meta-learning, where the model generalizes from prompt examples to new instances within the same context.
Studying Loss@N tokens, measuring bits/nats per token gained, and experimenting with few-shot prompts all shed light on the extent of a model’s in-context learning prowess. Phase transitions or cliffs in performance further illustrate how these capacities can emerge at scale.
ICL has revolutionized how we interact with AI systems, enabling rapid adaptation to specific needs without the traditional machine learning pipeline of data collection, training, and deployment. It represents a fundamental shift in AI application development—from engineering models to engineering prompts.
As models continue to scale and research progresses, ICL capabilities will likely become even more sophisticated. The boundary between “learning” and “prompting” may continue to blur, potentially creating AI systems that can truly learn new skills from minimal human guidance.
Further Reading
- Language Models are Few-Shot Learners — Brown et al., 2020 (Introduced GPT-3).
- Chain-of-Thought Prompting — Wei et al., 2022 (Demonstrates chain-of-thought reasoning in LLMs).
- Emergent Abilities of Large Language Models — Wei et al., 2022.
- Model-Agnostic Meta-Learning (MAML) — Finn et al., 2017 (Classic meta-learning approach).
- What Learning Algorithm Is In-Context Learning? Investigations With Linear Models — Akyürek et al., 2023.
- Transformers as Algorithms: Generalization and Stability in In-context Learning — Li et al., 2023.
By understanding and experimenting with in-context learning, one gains insight into how large language models can function as versatile, on-the-fly problem solvers—just by reading carefully crafted textual prompts.