Unboxing LLMs > loading...

February 9, 2025

“Thinking” LLMs on a Shoestring (S1): Fine-Tuning and Budget Forcing for Reasoning

Key Takeaways

  • A thousand hard questions are worth more than a million easy ones. Quality, not just quantity, is the catalyst for reasoning.
  • Parameter-efficient fine-tuning (LoRA) wrests sophisticated model adaptation from the realm of GPU-barons, making it tractable on a single node.
  • Imposing a “thought budget” at inference forces the model to deliberate, trading marginal latency for a significant leap in correctness.
  • The entire playbook executes for under $100. This isn’t a research curiosity; it’s a repeatable, production-ready recipe.
  • The next frontier is policy optimization-using RL to refine the model’s emergent reasoning pathways into hardened cognitive habits.

Introduction

The reasoning capabilities of off-the-shelf Large Language Models are often a convincing illusion, a parlor trick of pattern-matching that breaks under pressure. They are prone to confident, premature conclusions.

But as shown by recent work like s1: Reasoning on a Budget, this shallow cognition can be deepened. With a surgical application of high-quality supervision and a clever constraint at inference, we can force a model to move from mere reaction to structured deliberation.

This article lays out the exact methodology I used to awaken a more robust reasoning faculty in Qwen 2.5-7B:

  1. Curate a small but potent dataset through a ruthless triage.
  2. Fine-tune with LoRA to keep the compute budget sane.
  3. Budget-force the model at inference, compelling it to spend tokens on thought, not fluff.

Everything that follows is designed to run on commodity hardware. The code is minimal but robust.


1 · Culling the Herd: Forging a 1k Dataset from 59k Questions

Reasoning is bound by the quality of the data it’s trained on. The s1 pipeline starts with a sprawling corpus of 59,029 questions and subjects it to a brutal triage, distilling it to 1,000 exemplars by enforcing three filters:

flowchart diagram

  1. Quality. We first discard any malformed items-those with formatting errors, missing answers, or other structural defects.
  2. Difficulty. The remaining questions are posed to two baseline checkpoints (Qwen 7B & 32B). If either model solves a question correctly, it’s deemed too easy and culled. We are selecting for failure.
  3. Diversity. To avoid narrow overfitting, we perform stratified sampling across 50 distinct subject domains (math, law, biology, etc.), ensuring the final dataset possesses intellectual breadth.

The resulting dataset, simplescaling/s1K_tokenized on Hugging Face Hub, contains JSON-lines with ‘prompt’ and ‘chosen’ fields, primed for causal LM fine-tuning.


2 · LoRA: Surgical Adaptation Without the Brute Force

While the s1 paper used a formidable 16 \times \textrm{H100} cluster, the objective here was to prove this capability was not an artifact of brute-force compute. The entire procedure had to be tractable on a single GPU. For this, we turn to LoRA (Low-Rank Adaptation). LoRA freezes the vast majority of the pre-trained model’s weights and injects small, trainable low-rank matrices into each layer. This dramatically shrinks the optimization problem.

flowchart diagram

The implementation with Hugging Face libraries is direct:

from datasets import load_dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig, TrainingArguments)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
import torch

# Define model and dataset identifiers
MODEL_ID = "Qwen/Qwen2.5-7B-Instruct"
DATASET_ID = "simplescaling/s1K_tokenized"

# Initialize tokenizer, handling the pad token carefully.
# Qwen's official pad token is <|extra_0|>. It's safer to use a model's
# designated pad_token than to default to the EOS token.
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, padding_side="left")
if tokenizer.pad_token is None:
    if tokenizer.eos_token is not None:
        tokenizer.pad_token = tokenizer.eos_token
    else:
        # Fallback for models without a default pad or eos token.
        tokenizer.add_special_tokens({'pad_token': '[PAD]'})

dataset = load_dataset(DATASET_ID, split="train")

# Define a formatting function to structure data into the ChatML format Qwen expects.
def format_example(example):
    # This function converts prompt/chosen pairs into a single text field
    # adhering to the model's required conversation structure.
    return {
        "text": f"<|im_start|>user\n{example['prompt']}<|im_end|>\n<|im_start|>assistant\n{example['chosen']}<|im_end|>"
    }

formatted_dataset = dataset.map(format_example)

# Configure 4-bit quantization (QLoRA) to reduce memory footprint.
# bfloat16 is the compute dtype of choice for modern GPUs (Ampere architecture or newer).
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    attn_implementation="flash_attention_2", # Use Flash Attention 2 for efficiency
    device_map="auto" # Automatically map model layers across available devices
)
model = prepare_model_for_kbit_training(model)

# LoRA configuration: defines which layers to adapt and with what rank.
# Target modules are specific to the Qwen2 architecture.
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "up_proj", "down_proj", "gate_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

# Use a completion-only data collator to ensure loss is calculated only on the
# assistant's response, not the user prompt. This is crucial for effective tuning.
data_collator = DataCollatorForCompletionOnlyLM(
    instruction_template="<|im_start|>user",
    response_template="<|im_start|>assistant\n",
    tokenizer=tokenizer,
    mlm=False # This is a Causal LM, not Masked LM.
)

# Training arguments
OUTPUT_DIR = "qwen2.5-7b-s1k-lora-checkpoints"
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=32, # Achieves an effective batch size of 32
    num_train_epochs=5,
    learning_rate=1e-6, # A conservative LR, often appropriate for fine-tuning
    bf16=True,
    logging_dir=f"{OUTPUT_DIR}/logs",
    logging_steps=10,
    save_strategy="epoch",
)

# Initialize the trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=formatted_dataset,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=model.config.max_position_embeddings if hasattr(model.config, 'max_position_embeddings') else 2048,
    data_collator=data_collator,
    args=training_args
)

trainer.train()
trainer.model.save_pretrained(f"{OUTPUT_DIR}/final_adapter") # Save only the trained adapter

Peak reserved memory: \approx 35 GB on a 40 GB GPU. Wall-clock time: \approx 3h 12m for five epochs. Cloud cost: \approx $60 on GCP (A100 40GB).

The resulting LoRA adapters are trivial to store and deploy, weighing in at a mere \approx 200MB.


3 · Budget Forcing: Imposing Deliberation

Most “chain-of-thought” prompting techniques are either crude hard-coded hints (“think in n steps”) or inefficient rejection sampling (generate k responses, pick the best).

Budget forcing is a more elegant constraint. It caps the total number of “thinking” tokens the model can spend, but allows the model to determine how to allocate that budget across intermediate steps of deliberation.

The Logic in Pseudocode

sequenceDiagram diagram

A Minimal vLLM Implementation

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
import torch

# Initialize the vLLM engine with LoRA enabled.
# The base model is loaded, and LoRA adapters are applied on-the-fly per request.
llm_engine = LLM("Qwen/Qwen2.5-7B-Instruct",
                  enable_lora=True,
                  max_lora_rank=16,
                  tensor_parallel_size=1,
                 )

# Path to the trained LoRA adapter directory.
LORA_ADAPTER_PATH = "qwen2.5-7b-s1k-lora-checkpoints/final_adapter"

# Create a LoRA request object to identify the adapter for inference.
lora_request = LoRARequest(
    lora_name="s1_reasoning_adapter",
    lora_int_id=1, # A unique integer ID for this adapter
    lora_local_path=LORA_ADAPTER_PATH
)

tokenizer = llm_engine.get_tokenizer()
im_end_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
if im_end_token_id == tokenizer.unk_token_id:
    raise ValueError("<|im_end|> token not found in tokenizer. Check model configuration.")

def budget_force(question: str, max_total_think_tokens: int = 1000, num_forced_waits: int = 2) -> str:
    system_prompt = "You are a helpful AI assistant. Please think step by step to answer the question."
    
    # The prompt buffer starts with system instructions and the user's question,
    # then cues the model to begin its internal monologue.
    current_prompt_str = (
        f"<|im_start|>system\n{system_prompt}<|im_end|>\n"
        f"<|im_start|>user\n{question}<|im_end|>\n"
        f"<|im_start|>assistant\n<|im_start|>think\n"
    )
    
    remaining_think_budget = max_total_think_tokens
    
    for i in range(num_forced_waits):
        if remaining_think_budget <= 0:
            break

        # Cap each thought-generation step to a reasonable length.
        tokens_for_this_step = min(remaining_think_budget, 512)

        # Generate a chunk of thought.
        # This implementation re-sends the entire prompt buffer. While simple,
        # a production system would optimize this by managing the KV cache directly.
        outputs = llm_engine.generate(
            [current_prompt_str],
            SamplingParams(
                max_new_tokens=tokens_for_this_step,
                stop_token_ids=[im_end_token_id], # Stop if the model concludes its turn prematurely
                temperature=0.0, # Greedy decoding for deterministic reasoning
            ),
            lora_request=lora_request
        )
        
        generated_output = outputs[0].outputs[0]
        thought_chunk_text = generated_output.text
        num_tokens_in_chunk = len(generated_output.token_ids)

        current_prompt_str += thought_chunk_text
        remaining_think_budget -= num_tokens_in_chunk

        # If more thinking is forced, add a continuation cue.
        if i < num_forced_waits - 1:
            current_prompt_str += " Wait. Let me continue thinking.\n<|im_start|>think\n"
        else:
            # Transition to generating the final, user-facing answer.
            current_prompt_str += "\n<|im_start|>assistant\n"

    # Generate the final answer based on the accumulated thought process.
    final_outputs = llm_engine.generate(
        [current_prompt_str],
        SamplingParams(
            max_new_tokens=512,
            stop_token_ids=[im_end_token_id],
            temperature=0.0
        ),
        lora_request=lora_request
    )
    
    final_answer = final_outputs[0].outputs[0].text
    return final_answer.strip()

A Toy Example

>>> budget_force("How many 'r's are in the word raspberry?")
"After counting carefully I find 3 occurrences of the letter 'r'."

Without forcing, the same checkpoint prematurely answered “2”. The imposed deliberation corrected the error.

Compute trivia: The total extra latency is roughly proportional to num_forced_waits. Each pass reuses the key-value cache for the shared prefix, making these iterations computationally cheaper than generating independent samples from scratch.


4 · The Empirical Result

Setting Accuracy (AGIEval  subset) Median Response Tokens (Final Answer) Avg. CUDA ms (Latency)
Base Qwen 7B 36% 12 75
LoRA-s1 54% 14 78
LoRA-s1 + budget forcing (2×Wait) 63% 54 92

The data is unambiguous. Forcing deliberation buys a +9 percentage point gain in accuracy on this benchmark. The cost is a marginal +14 ms of latency and a more elaborate final response. In most contexts where correctness is valued, this is a bargain.


5 · Constraints and Trajectories

  • LoRA Rank is consequential. A rank of 8 proved insufficient in my tests; rank 16 was the sweet spot for this setup. A higher rank, perhaps 32, might further close the performance gap to full fine-tuning, albeit at a higher memory cost.
  • Dataset formatting is not universal. The s1K dataset is formatted for Qwen’s ChatML structure. Adapting this methodology to other models like Llama 3 would necessitate reformatting the data to match their specific conversation templates.
  • The next layer is Reinforcement Learning. A policy-gradient optimization pass (using DPO, PPO, or RAFT) over the transcripts generated by budget forcing could be used to further refine the model’s reasoning chains, making them more stable and efficient.
  • Scaling data with self-instruction. For larger projects, the creation of reasoning-focused training data can be scaled by using a generator model to write its own “thinking” traces, a form of automated curriculum development.

Conclusion

With a thousand hard questions, a dash of LoRA, and a clever inference loop, we can teach a 7B parameter model to think more deeply-all for less than the cost of a team dinner. For practitioners building real-world systems, the message should be clear:

Reasoning isn’t just for giant models or giant budgets.

Surgical data curation, parameter-efficient tuning, and intelligent constraints are the tools that unlock state-of-the-art performance without state-of-the-art expense.

The path forward lies in further refining these adapted models, likely through RL, to make their emergent cognitive abilities more robust and reliable.

Posted in AI / ML, LLM Advanced