An Engineer’s Guide to Fine-Tuning LLMs – Part 1
Understand where fine-tuning fits in the LLM lifecycle, how it compares to prompting or retrieval — and when it’s the right tool for the job.
You're building a Q&A assistant for your internal analytics platform. You start with a powerful base model like Llama 3 or GPT-4o and implement a RAG (Retrieval-Augmented Generation) pipeline to feed it your table schemas and query examples.
It works, but only up to a point. Soon, the cracks appear:
Inconsistent Formatting: The model ignores your specified output structure, like consistently generating clean SQL or JSON.
Brittle Prompts: You're constantly tweaking prompts and few-shot examples just to maintain predictable behaviour for slightly different user inputs.
Poor Steerability: The model fails to adhere to specific constraints, like always using
JOIN
on the correct foreign key or avoiding deprecated functions.
You're no longer just guiding the model; you're fighting its fundamental tendencies.
This isn't a knowledge problem; RAG is already providing the necessary context. This is a behaviour problem.
Prompting is about giving the model better instructions.
RAG is about giving the model better knowledge.
Fine-tuning is about teaching the model a new skill.
Fine-tuning fundamentally changes the model itself. By updating the model's internal weights on your own curated data, you aren't just telling it how to act—you are reshaping it to be the model your product needs. It internalises your specific data structures, response formats, and desired logic.
In Part 1 of this two-part issue, we'll cover:
The Core Mechanism: Understand what fine-tuning actually changes in a model and why it's a completely different tool than prompting or RAG.
The Strategic Context: See where fine-tuning fits in the LLM lifecycle to understand its unique power as an application-layer tool.
The Decision Framework: Get a clear set of green flags for when to commit to fine-tuning, and the critical red flags that tell you it will waste your time.
Let's dive in.
1. The LLM Value Chain: Pre-training, Alignment, and Specialisation
(Recap from Issues 1 & 2)
To understand why fine-tuning exists, you need to see where it fits in the lifecycle of a language model. Let's stitch the layers together.
1.1 Pre-training: The Foundation Layer
This is where models like GPT-4o, Claude, and DeepSeek-R1 begin. Pre-training is unsupervised learning on massive text corpora by predicting the next token in a sequence. No tasks, no labels, just pure pattern recognition at scale. This gives the model general linguistic competence and vast factual knowledge, but it doesn’t know how to follow instructions safely or effectively.
1.2 Alignment: Teaching the Model to Behave
A raw, pre-trained model needs to be made useful. The alignment phase teaches it to:
Respond to instructions clearly
Refuse unsafe prompts
Format answers correctly
Align with human expectations
This is typically done via a suite of fine-tuning techniques:
Supervised Fine-Tuning (SFT): Training on instruction-response pairs to teach task formats.
RLHF & DPO: Optimising outputs based on human preference pairs to guide model behaviour.
Technically, fine-tuning means updating a pre-trained model’s weights with new data, which is exactly what happens here. So when you use GPT-4o, you’re already using a model that has been fine-tuned by its provider for general helpfulness, but not for your specific domain or product.
1.3 Application-Layer Fine-Tuning: The Missing Layer
This brings us to the layer you control. As a builder, you don’t control pre-training or the base alignment objectives. But you can fine-tune the model again, on your own domain, tasks, and constraints, to make it work inside your system.
This issue is about that layer.
It's the same mechanism for a different purpose. Fine-tuning, not to teach the model how to behave, but to make it behave your way.
2. What Fine-Tuning Really Is
In the last section, we mapped out the LLM value chain. Now, it's time to get precise about the layer you control. What exactly is fine-tuning, what makes it fundamentally different from prompting or retrieval, and why is it the only method that changes the model itself?
2.1 The Systems View: Context vs. Weights
Every large language model operates on two distinct layers of information:
The Weights: Billions of learned parameters that encode what the model has internalised during training.
The Context: Everything you pass in at runtime—your prompt, few-shot examples, retrieved documents, and tool specs.
Most techniques, like prompting and RAG, operate solely on the context. They inject information at inference time, guiding the model without changing its internal structure.
Fine-tuning is different. It directly updates the weights. It is another round of training on your specific data, using the same architecture and backpropagation, to permanently reshape how the model generalises. You're not just showing it new information; you're changing how it "thinks."
2.2 What Fine-Tuning Actually Changes
A model's weights do more than store facts; they define how it interprets prompts, routes reasoning, and prioritises outputs under uncertainty. Fine-tuning shifts these core decision boundaries.
It teaches the model to handle variations in phrasing without needing extra prompt instructions.
It makes your desired output structure a native behaviour, not just a format to be followed.
It embeds your domain-specific logic directly into the model, reducing reliance on bulky few-shot examples.
The resulting model doesn't just respond differently—it reasons differently. This is what makes fine-tuning a structural change, not a surface patch.
2.3 Why Provider Fine-Tuning Isn't Enough
The off-the-shelf model you use has already been fine-tuned by its provider for general safety and helpfulness. But that is not the same as optimising for specific, high-stakes business logic, such as:
Emitting outputs that are bound to a strict API contract.
Correctly resolving ambiguous terms unique to your product's schema.
Refusing to access sensitive data in your internal systems, even when asked politely.
These are not general instruction-following problems; they are domain adaptation problems that depend on your data and success criteria.
Prompting can mask these issues, and retrieval can bridge knowledge gaps, but only fine-tuning can encode this specialised behaviour directly into the model's weights.
It has a higher upfront cost, but it's the only method that makes the model truly yours.
3. The Adaptation Spectrum: Prompting → Retrieval → Fine-Tuning
Fine-tuning isn't the first tool you reach for, nor should it be. In the application layer, there's a spectrum of techniques engineers use to adapt a language model's behaviour.
Each one solves a different kind of problem. If you don't understand what each technique does, you risk overengineering the solution or misdiagnosing the issue entirely.
This section lays out what each technique changes
3.1 Prompting: Leveraging What the Model Already Knows
Prompting is the cheapest lever and, in many cases, a surprisingly effective one. Even base models that haven't been aligned show signs of instruction-following. Why? Because of how they're pretrained.
When you pretrain on large corpora of internet text, code, and tutorials, the model learns patterns like:
Q:
followed byA:
Function definitions followed by documentation
"How do I..." questions followed by step-by-step instructions
So when you write a clean instruction, even to an unaligned model, you're not asking it to be helpful. You're asking it to complete a statistical pattern it's already seen thousands of times.
This became widely visible with GPT-3, where studies showed that even without gradient updates, zero-shot and few-shot prompts produced relevant answers.
The Catch: It's a runtime illusion. Prompting gives the model hints; it doesn't rewire its understanding. Failure modes appear quickly: outputs are sensitive to phrasing, behaviour breaks with inconsistent inputs, and structured output like JSON or SQL is brittle.
3.2 Retrieval: Giving the Model New Knowledge at Runtime
When the model lacks facts about your company, product, or recent policies, prompting isn't enough. This is where retrieval augmentation comes in.
You build a retrieval layer that fetches relevant documents and injects them into the model's context window at inference time. This doesn't change the model's weights, but it changes what it sees before generating a response.
This works exceptionally well when:
You need factual accuracy grounded in private or internal data.
The query is specific to your business, user, or task.
You want outputs to reflect recent changes without retraining the model.
The Catch: Retrieval provides facts, not skills. The model can see the right information but still uses the wrong tone, fails to follow complex domain-specific logic, and can't reliably generate the structured JSON or API calls your system requires.
Retrieval fills knowledge gaps, but it doesn't modify how the model uses that knowledge.
3.3 Alignment: Making the Model Generally Helpful
The model you use in production isn't just a pretrained model; it's a pretrained and aligned model.
After pre-training, providers run more fine-tuning phases using SFT and RLHF/DPO to make the model follow instructions and prefer helpful, safe outputs. This is provider-controlled, general-purpose fine-tuning.
Alignment datasets include common-sense Q&A, summarisation, and dialogue. What they don't include:
Your business logic
Your tool APIs
Your schema constraints
Your data formats
The Catch: Alignment optimises the model to be broadly helpful, but not precisely correct. The model will try to please the user, but won’t obey your internal logic.
3.4 Fine-Tuning: When Behaviour Has to Live Inside the Model
Prompting and retrieval adapt the model from the outside, but don't change how it generalises or shift its internal representations. That's what fine-tuning does.
When you fine-tune, even partially, with methods, you are updating the weights. You're modifying the statistical pathways that govern interpretation and retraining the model's instincts.
Done well, fine-tuning enables:
Structural consistency: Always outputting a tool call in your exact JSON schema, even if the user request is vague or phrased differently.
Domain-native reasoning: Applying internal business rules or specialised jargon as if they were part of the base training data.
Prompt-free formatting: You don’t need 20-shot prompts to guide output behaviour, it’s embedded in the weights.
Latency and context savings: No need to re-explain your needs every time, the model starts closer to your expected output by default.
Where alignment seeks to make a model broadly helpful, fine-tuning makes it specifically reliable.
It’s heavier. It needs infrastructure.
But it’s the only lever that actually changes the model’s behaviour permanently, across phrasing, across prompts, across tasks.
4. When Is Fine-Tuning the Right Answer?
You've seen the full adaptation stack. This brings us to the question every builder eventually hits:
"Should we tune this model, or are we just not prompting it well enough?"
Here’s the simplest way to think about it: Fine-tuning is what you do when your prompts have plateaued, retrieval has hit its limits, and the model's general alignment doesn't transfer to your specific task.
The failure isn't about what the model sees—it's about how it behaves.
Let's walk through the five scenarios where fine-tuning becomes a real solution, not just a nice-to-have.
4.1 You Need to Reliably Enforce a Strict Structure
You're trying to get the model to generate structured data, like JSON objects or API calls. Your prompt engineering is sophisticated, using few-shot examples and schema definitions.
But you still face constant issues:
Brittleness: The model works for common cases but breaks on novel inputs.
Inconsistency: It occasionally hallucinates fields, uses the wrong data types, or generates unparsable syntax.
This is a classic limitation of in-context learning. The model is mimicking patterns, not learning the underlying grammar of your format.
Fine-tuning makes that structure a native capability. By training on hundreds or thousands of valid examples, the model internalises the schema's rules.
4.2 You Need to Master a Complex, Nuanced Task
Your task requires a nuanced understanding that goes beyond general knowledge. For example, classifying user feedback into a highly specific, multi-level taxonomy with over 50 labels.
With prompting, you hit a performance ceiling because:
The context window can't hold enough examples to cover all the edge cases.
The model struggles to differentiate between closely related labels.
This isn't a prompting failure; it's a task-understanding gap. The model lacks a deep representation of your specific problem space.
Fine-tuning closes this gap by training the model on thousands of labelled examples, allowing it to learn the subtle patterns and decision boundaries of your unique taxonomy.
4.3 Your Domain's Semantics Are Underrepresented
Your application operates in a specialised domain like bioinformatics, patent law, or financial compliance. The base model, trained on the general internet, doesn't understand your domain's jargon, entities, and relationships. It treats critical keywords as noise.
While RAG can retrieve documents, it doesn't teach the model how to interpret them like an expert.
Fine-tuning teaches the model the semantics of your domain. It learns that in a medical context, for instance, certain terms have specific implications that are absent in general text.
4.4 You Need to Transfer a Capability to Another Language
Your product works flawlessly in English, but its performance collapses for your German or Japanese users. The model can translate text, but it fails to apply complex skills, like its ability to follow multi-step instructions or use tools, in the new language.
This is a capability-transfer gap. The complex reasoning skills learned during alignment are often English-centric and don't automatically generalise.
Fine-tuning on multilingual examples of your specific task is the most effective way to transfer a model's core capabilities across languages.
4.5 You Need to Distill a Capability into a Cheaper, Faster Model
Your prototype, built on a state-of-the-art model like GPT-4o, works perfectly but is too slow and expensive for production scale.
When you try to switch to a smaller, faster open-source model, performance plummets.
This is a common use case for distillation, a form of fine-tuning where you train a smaller "student" model on the outputs of a larger "teacher" model.
You use the teacher model to generate a high-quality, synthetic dataset of thousands of examples for your specific task.
Then, you fine-tune the smaller student model on this data.
5. When Fine-Tuning Will Waste Your Time
Fine-tuning isn't cheap in time, cost, or complexity. And unlike a prompt tweak, its effects aren't easily reversible. Before committing to a fine-tuning project, you must ensure you aren't trying to solve the wrong kind of problem.
Here are four red flags that should tell you: "This isn't a fine-tuning job—or at least, not yet."
5.1 You Don't Have Enough High-Quality Data
Fine-tuning works by adjusting the model's weights based on the patterns in your examples. If those examples are too few, too noisy, or don't accurately represent the behaviour you want, you will be training the model on garbage.
Common failure modes include:
Overfitting: With too few examples (e.g., < 500-1000 for many tasks), the model doesn't learn the general skill you're trying to teach. It just memorises the specific examples, failing to generalise to new, unseen inputs.
Noise Amplification: If your data is noisy or inconsistently labelled, the model will faithfully bake that noise directly into its weights, making its behaviour erratic and unreliable.
A critical first step is to validate your task with In-Context Learning (ICL), or few-shot prompting. If you can't get reasonable performance by showing the model a handful of high-quality examples in a prompt, fine-tuning on a larger set of those same examples is unlikely to succeed.
5.2 Your Task Relies on Volatile, Fast-Changing Information
Fine-tuning is for teaching the model a persistent skill or style. It is the wrong tool for teaching knowledge that changes frequently.
If your use case relies on information that is updated daily or hourly, such as:
Product inventory and pricing
Live news or event tracking
Real-time user data
...then a fine-tuned model will be perpetually out of date. Each update would require retraining and redeploying the model, creating a massive maintenance overhead.
This is a classic use case for Retrieval-Augmented Generation (RAG). RAG is designed to provide the model with fresh, volatile information at inference time, separating the model's stable "skills" from the dynamic "knowledge" it needs to act on.
5.3 You Have Strict Latency or Deployment Constraints
Fine-tuning can impact your deployment architecture and performance budget. Even parameter-efficient methods like LoRA require the entire base model's weights to be loaded into GPU memory for inference.
This presents a problem if you are deploying in a constrained environment:
Edge/Mobile: A 7-billion parameter model, even a quantised one, can be too large and slow for on-device applications with tight memory and latency budgets.
High-Throughput Services: If your service needs to handle thousands of requests per second with low latency, the cost of serving a large, fine-tuned model can be prohibitive.
Before fine-tuning, profile your target deployment environment. Sometimes, clever prompting on a smaller, faster model or using a highly optimised API is a better solution than deploying a fine-tuned model yourself.
5.4 The Task Demands Immediate Controllability
Fine-tuning hardcodes behaviour. If the model develops a flawed or harmful tendency, you can't fix it with a quick prompt change. The flaw is now part of the model's core logic, and fixing it requires a new training and deployment cycle.
This creates a critical trade-off: power vs. control.
While fine-tuning is powerful for teaching domain knowledge (as we saw in 4.3), it is a high-risk choice for applications where the ability to immediately patch, steer, or disable a behaviour is the top priority.
This is especially true for:
Customer-facing applications with direct brand exposure.
Domains where new failure modes can emerge rapidly (e.g., new types of scams or adversarial attacks).
In these scenarios, keeping logic in the prompt or in external rule engines gives you more immediate control. You can update a guardrail prompt in seconds; a fine-tuning run takes hours or days. If your operational posture requires instant intervention, fine-tuning may introduce an unacceptable level of response lag.
Wrap-up: The Strategic Edge
Fine-tuning isn’t a magic wand. It’s a deliberate architectural decision to trade runtime controllability for baked-in, specialised behaviour.
You now have the framework to make that trade. You know when to pull that lever—to enforce a strict structure, master a complex task, or distill a capability into a more efficient model. And just as importantly, you know the red flags that signal it's the wrong choice, saving you from wasting time, money, and effort.
But strategy is only half the equation.
In Part 2, we go from “Should I fine-tune?” to “How do I fine-tune well?”
We’ll cover the engineering reality of execution:
The Modern Methods: How to navigate the trade-offs between full fine-tuning, the efficiency of PEFT (LoRA/QLoRA), and the scale of distillation.
The Production Pipeline: A step-by-step walkthrough of the end-to-end workflow, from curating the perfect dataset to safe deployment and monitoring.
The Technical Risks: A guide to the real-world failure modes—like catastrophic forgetting and silent regressions—and the engineering discipline required to prevent them.
If Part 1 was about when to do it, Part 2 is about how to do it without breaking everything.