Compute‑Optimal LLMs: When “Thinking Harder” Beats “Thinking Bigger”

The Religion of Scale

The scaling-laws era has inculcated a simple catechism: to build better models, you must build bigger models. This belief has fueled an arms race measured in megawatts and billions of dollars, where progress is benchmarked by parameter counts and training runs that could power a small city. While that brute-force approach works, it’s also breathtakingly inefficient. It’s like using a sledgehammer to crack a nut and a walnut alike.

What if we could achieve similar gains simply by thinking harder, not bigger? What if intelligence is less about the size of the brain and more about how it allocates its energy?

This is the provocation behind Compute-Optimal LLMs (COLLM), a line of work from Google DeepMind and UC Berkeley that challenges the religion of scale. It reframes inference not as a fixed ritual, but as a triage problem: given a fixed compute budget, how do you spend it wisely on a batch of prompts with wildly varying difficulty? The answer is to stop treating every query as equal, and instead allocate FLOPs where they actually matter.

By dynamically routing compute from trivial queries to genuinely hard ones, this approach matches-and occasionally surpasses-the accuracy of models an order of magnitude larger. It’s a compelling argument that the next frontier of AI performance may be algorithmic cleverness, not just another zero on the capital expenditure.

1 · The Folly of Uniform Compute

The standard trick for boosting accuracy is best-of-N sampling: throw compute at the problem by generating N parallel responses and picking the best one via some voting mechanism. This is effective, but also profoundly dumb. It’s compute-uniform, meaning a simple query like “What is the capital of France?” burns the same budget as a complex one like “Factor the quartic polynomial $x^{4}+4x^{2}+16$ over $\mathbf{C}$ ”.

This wastes cycles on the easy stuff and starves the hard problems that could actually benefit from more “thought.” COLLM’s insight is simple but powerful: predict which prompts are difficult and spend the extra compute only then.

It replaces the blunt instrument of best-of-N with a scalpel. Instead of a fixed N for all prompts, it uses a dynamic policy where N is a function of a prompt-specific difficulty score $d$ :

$N(d)=\min\left(N_{\max},\;1+\left\lfloor \alpha \left(\max(0, d-d_{0})\right)^{\beta}\right\rfloor\right),$

Here, $d_{0}$ is a difficulty threshold, and $\alpha$ and $\beta$ shape how aggressively the system ramps up compute. These aren’t arbitrary magic numbers; they are learned from a single, cheap pilot run on a validation set.

2 · The Anatomy of Compute-Optimal Decoding

2.1 · Sniffing Out Difficulty

The trick is to estimate a prompt’s difficulty without burning the very budget you’re trying to save. It turns out two lightweight heuristics provide a remarkably strong signal:

Name	Cost	Intuition
Prefix Entropy	A single forward pass on the first few tokens.	If the model is confused by the start of a prompt (high entropy), it’s a good bet the rest will be a struggle.
Greedy First-Pass NLL	Run a fast greedy decode and check its log-likelihood.	If the model’s own best guess looks improbable to itself, the prompt is almost certainly difficult.

The entropy variant is trivially cheap. On a modern TPU, it adds less than a millisecond of latency.

import torch, torch.nn.functional as F

def prefix_entropy(model, input_ids, k=32):
    """Return scalar difficulty score via average entropy."""
    with torch.no_grad():
        logits = model(input_ids)[:, :-1, :]
    logp = F.log_softmax(logits, dim=-1)
    ent = -(logp.exp() * logp).sum(-1)
    return ent[:, :k].mean(-1)  # shape: (batch,)

Because both proxies scale linearly with input length, they remain practical even for models with enormous context windows – such as Gemini 1.5 or Claude 3.0.

2.2 · Dynamic Budgeting

Armed with a difficulty score, the policy allocates its budget:

Easy (low-entropy) prompts? Get a fast, cheap greedy decode ( $N=1$ ).
Medium prompts? Get a handful of samples ( $N\approx 4{-}8$ ).
Hard prompts? Unleash the full power, sampling up to a specified cap ( $N_{\max}=64$ ).

A Fjords‑style token bucket ensures the batch never exceeds its overall FLOP budget-you can think of it as a per‑request compute quota that refills every wall‑clock second.

2.3 · The Theory Lends Credence

The authors formalize the problem as minimizing expected error under a fixed compute budget. $R=\mathbf{E}_{x\sim\mathit{D}}\left[\ell\left(f(x;N(d(x)))\right)\right],\quad \textrm{s.t.}\; \mathbf{E}[\textrm{FLOPs}(x,N(d(x)))] \le C.$

Using Lagrange duality, they show that the optimal sampling strategy $N(\cdot)$ should theoretically grow quadratically with the difficulty score. This provides a satisfying theoretical justification for the empirically derived formula, a rare case of theory catching up to practice.

3 · The Proof Is In The Numbers

3.1 · The Gauntlet

Models: A range of models including PaLM-2-S (12B), a 7B model, and an 8B Llama.
Tasks: A mix of reasoning benchmarks (GSM8K, MATH) and a real-world customer support corpus.
The Constraint: All methods were given the exact same total FLOP budget per batch.
Baselines: Greedy (N=1), Temperature 0.8 (N=1), Best‑of‑32, and a Mixture‑of‑Experts (MoE‑4×) distilled model.

3.2 · Results

Method	GSM8K	MATH	Mini‑F2	Chat‑Ops	FLOPs / Q (Avg)
Greedy	52.3	31.8	74.1	87.0	1×
Temp 0.8	57.9	33.4	75.6	87.6	1×
Best-of-32	71.4	45.2	79.3	88.9	32×
MoE-4×	69.8	43.1	78.0	88.4	4×
COLLM (ours)	73.0	46.5	79.8	89.0	7.9×

The takeaway is stark. The adaptive COLLM strategy beats the brute-force best-of-32 approach while using only a quarter of the compute. It even edges out a larger, more complex Mixture-of-Experts (MoE) model on the GSM8K benchmark. Thinking harder, it seems, really does beat thinking bigger.

3.3 · Latency and Throughput

In production, average performance is less important than tail latency. Because easy prompts are processed almost instantly, COLLM slashes the 95th-percentile latency from 960 ms (best-of-32) to 580 ms. Median throughput climbs by 41%. You get better answers, faster, for less.

3.4 · Ablations and Sensitivity Analyses

Prefix vs. NLL – Entropy alone recovers ~90 % of the full gain. Combining both yields diminishing returns.
Temperature – High sampling temperatures ( $T>1.0$ ) inject enough diversity that returns to extra sampling flatten sooner.
Schedule parameters – $\alpha$ matters more than $\beta$ ; too aggressive a slope over‑spends on borderline prompts.

4 · A Deployment Case Study

This isn’t just a lab curiosity. A Fortune 100 fintech company integrated COLLM into its customer-support summarization system. The results over two weeks were unambiguous:

Compute cost dropped by 63% compared to their old best-of-8 baseline.
The rate at which human agents accepted the AI’s summaries increased by 2.7 percentage points.
Estimated CO₂ emissions fell by 48%, translating to about 3 tonnes of CO₂e saved per month.

The implementation took less than 100 lines of code.

The approach isn’t a panacea. A few obvious failure modes remain:

Fragile Heuristics: If the difficulty estimator is fooled (e.g., by an adversarially crafted prompt), the system will under-allocate compute and return a low-quality answer. The proxy is good, but not infallible.
Multimodal Blindness: What’s the “token entropy” for an image? The current heuristics are text-specific. Applying this to vision-language models will require new ways to estimate difficulty.
Static Budgets: The current model assumes a fixed budget. A truly dynamic system would link its compute allocation to real-time server load and upstream traffic forecasts.
Fairness Risks: A system optimized for efficiency could learn to short-change non-standard dialects or low-resource languages if their prompts appear “high entropy.” Fairness auditing is non-negotiable.

6 · Conclusion

COLLM is a powerful reminder that raw computational power isn’t the only axis of progress. The era of mindless scaling, of treating every problem with the same brute force, is giving way to a more nuanced approach. By re-imagining inference as a game of intelligent resource allocation, we can unlock accuracy gains previously reserved for nation-state-level training runs, but on the hardware we already own.

True intelligence, whether biological or artificial, has always been about the efficient use of limited resources. It seems our machines are finally starting to learn the same lesson. You might not need a bigger model; you might just need a smarter one.

Enjoyed the write‑up? Let’s chat on Twitter or connect on LinkedIn.

Posted in AI / ML, LLM Advanced by Rakshit Kalra