NeoSage

The Prompt Lifecycle Every AI Engineer Should Know

Dimple Sharma — Sat, 07 Feb 2026 15:58:54 GMT

Many people think “prompt engineering” means finding clever ways to talk to ChatGPT. And sure, if you’re turning your vacation photos into Ghibli art, that’s fine.

But when you’re building production systems that talk to LLMs through APIs? That’s a completely different problem.

Here’s the pattern: Your new AI-powered support bot is a hit. For a week, it’s the star of your engineering retrospective. Then the 3 AM PagerDuty alert fires. A silent model update broke accuracy. Someone found a prompt injection vulnerability. The cloud bill tripled. And the failing prompt? It’s a raw string, hardcoded in a dozen files, and no one can tell which version was even running.

If this sounds familiar, you’ve hit the most common production landmine: treating prompts like throwaway strings instead of mission-critical infrastructure.

prompt = f"""Extract the user's name, order ID, and the specific issue from this support ticket. Format the output as a JSON object with the keys 'customerName', 'orderId', and 'issueSummary':\\n\\n{support_ticket_text}"""

Here’s why that single line of code is a ticking time bomb:

Prompt Rot: Your prompt’s behavior is tightly coupled to a specific model version. When the provider updates the model (which they do, constantly), the subtle patterns your prompt relied on can shift, causing performance to decay silently. The prompt “rots” without any code ever changing.
The Versioning Black Hole: When a failure occurs, can you definitively say which prompt version was responsible? Without a versioning system, debugging is guesswork. You can’t roll back, and you can’t reliably reproduce successes.
The Observability Black Box: Is a prompt slow? Is it expensive? Is it consistently failing for a specific user segment? When your prompt is just a string, it has no telemetry. You’re flying blind, unable to track latency, token costs, or quality scores.
The Economic Drain: Hardcoded prompts are rarely optimized. They’re bloated with unnecessary verbosity or inefficient few-shot examples, leading to higher token counts that bleed your budget, one API call at a time.
Security Blindspots: A raw, unvalidated string passed to an LLM is a security vulnerability waiting to happen. With prompt injection, a malicious user overrides your instructions. It is not a theoretical threat. It happens when you treat user input as trusted text.

This is a systems engineering problem. And it demands an engineering solution.

The Systemic Solution: The Prompt Lifecycle

So, what does the solution look like in practice?

Treat prompts like critical software artifacts. Version them. Test them. Monitor them. We solved this for our application code decades ago with DevOps. The chaos of ad-hoc prompt management is not a new type of problem; we’re just dealing with a new type of artifact. The solution, therefore, is to apply the same battle-tested engineering discipline.

Welcome to the Prompt Lifecycle.

This is the core mental model for shifting from fragile strings to a production-grade system. It’s a continuous, circular process for managing prompts with the same rigor as any other piece of your infrastructure. It consists of five distinct, non-negotiable stages:

Design: This is where you define the prompt not as a raw string, but as a structured, version-controlled asset. You use templates to separate logic from data and a clear schema to define its metadata, parameters, and target model.
Test: Before a prompt ever sees production traffic, it must pass a rigorous, multi-layered evaluation suite. This is where you move from a subjective “looks good” vibe check to data-driven proof that the prompt is effective, reliable, and safe.
Deploy: Once validated, the prompt is published to a centralized Prompt Registry. This creates a single source of truth, allowing your applications to fetch specific, versioned prompts dynamically without requiring a full code deployment.
Monitor: After a prompt is live, you need eyes on it. This stage is about collecting critical, real-world telemetry, such as tracking latency, token costs, and quality scores to understand how the prompt is actually performing in the wild and to catch regressions before they become incidents.
Maintain: The lifecycle doesn’t end at deployment. Based on monitoring data and new business requirements, prompts are versioned, improved, or gracefully retired. This is the feedback loop that ensures your system evolves and adapts over time.

This five-stage loop transforms your process from a linear, fire-and-forget task into a sustainable, continuous improvement cycle. It’s the engineering foundation for building with AI, not just dabbling in it.

Today’s issue breaks down the engineering lifecycle for production prompts. It’s one piece of a much bigger puzzle: building production AI systems that actually work.

If you’re looking for a structured and hands-on way to step into AI engineering, the Engineer’s RAG Accelerator is for you. Check it out here:

The Engineer's RAG Accelerator

Now, let’s get back to our prompt system.

Design & Development: Crafting Maintainable Prompts

The Prompt Lifecycle begins with Design. And let’s be clear: this isn’t a creative writing exercise; it’s an architectural one. We’re going to transform a brittle string into a robust software artifact by giving it a formal, enforceable structure.

This structure has three essential components.

Component 1: Decouple with Templates

First, we kill the “String-in-Code” anti-pattern by separating the prompt’s static logic from its dynamic data. A template engine like Jinja2 is the standard tool for this job. It lets you build prompts that contain logic (for example, in the code below “if output_format is ‘detailed’ then request these fields”), while the application code is only responsible for providing the data (such as support_ticket_text, output_format).

This is a clean separation of concerns. The application handles what data to send; the template handles how that data is presented to the model.

Component 2: Define a Formal Schema

Next, we elevate the prompt from a loose text file to a true, self-describing artifact. We do this by defining it in a structured YAML file, which bundles the template itself with a rich set of auditable metadata.

This schema is the canonical definition of your prompt. It’s the manifest. You can optionally enforce it with validation libraries like Pydantic to guarantee that every prompt in your system is a well-defined, predictable asset. This YAML file becomes the single source of truth for your prompt’s structure and requirements.

A professional prompt definition looks like this.

# summarize_ticket.v1.yaml
# ══════════════════════════════════════════
# METADATA: Describes the prompt artifact
# ══════════════════════════════════════════
name: SummarizeSupportTicket
description: "Generates a concise summary of a user support ticket for internal review."
version: 1
tags: ['support', 'summarization']

input_variables:
  - support_ticket_text
  - output_format # Can be "detailed" or "summary"

execution_settings:
  model: "claude-3-opus-20240229"
  temperature: 0.5
  max_tokens: 512

# ══════════════════════════════════════════
# TEMPLATE: The actual Jinja2 prompt logic
# ══════════════════════════════════════════
template: |
  Your task is to extract information from the following support ticket.
  Support Ticket: {{ support_ticket_text }}

  {% if output_format == "detailed" %}
  Please format the output as a detailed JSON object with the following keys: 'ticketId', 'customerEmail', 'submittedAt', 'productTier', and 'fullIssueDescription'.
  {% elif output_format == "summary" %}
  Please format the output as a compact JSON object with only the following keys: 'ticketId' and 'issueSummary'.
  {% endif %}

Suddenly, your prompt isn’t a guess; it’s a specification. It has a name, a version, explicit inputs, and the exact model settings it was tested against.

Component 3: Establish Version Control

Finally, and this part is non-negotiable: these .yaml files are committed to Git.

This gives your prompts the same safety net you have for every other critical piece of your infrastructure: a complete audit trail (git blame), safe rollbacks (git revert), and a clear comparison between versions (git diff). If your prompt isn’t in version control, it doesn’t exist.

With these three components, you’re no longer wrestling with a hardcoded string. You have a structured, versioned, and auditable software artifact.

Now, let’s go prove it actually works.

Testing & Evaluation: From "Looks Good" to "Provably Good"

Hope is not a testing strategy.

Now that you have a structured, versioned artifact, how do you prove it’s any good? In conventional engineering, the answer is a test suite. For prompts, the discipline is the same, but the methods are new. It’s time to move from a subjective “looks good to me” spot-check to a rigorous process that proves your prompt’s quality with data.

The most effective way to do this is to adopt an Evaluation Maturity Model. Think of it as a three-level roadmap, starting with a simple foundation and building towards a state of automated, runtime guarantees. As the “Prompt Evaluation Pyramid” shows, each level provides a new layer of confidence.

Level 1: The Foundation - Curated Golden Datasets

This is where every professional testing strategy begins. You create a “golden dataset”: a curated collection of diverse, representative inputs and their corresponding, ideal outputs. These are the canonical benchmark for your prompt. When you draft a new version, you run it against the golden inputs and compare the model’s output to your ideal answer.

The Golden Dataset Reality: Golden datasets require knowing what “correct” means for your specific task.

Tasks with exact answers: Easy to verify
Examples: Classification (“Is this spam?” → Yes), extraction (“Pull order_id from receipt” → “12345”)
Tasks with quality criteria: Need defined rules
Examples: Summarization (“under 100 words, captures main points”), rewrites (“professional tone, preserves facts”)
Tasks depending on context: Hardest to evaluate
Examples: Customer support replies (tone varies by sentiment), recommendations (depend on user history)

This is your essential safety net for catching regressions. Skip this and you have no way to measure if your changes help or hurt.

Level 2: Automation - Metric-Based Evaluation

Now you have a golden dataset. Automate the comparison. Evaluation frameworks like Ragas or DeepEval run your prompts against your golden dataset and calculate quantitative scores using different metric types:

Deterministic: Exact match checks
Fast, catches obvious failures.
Example: Does output[“order_id”] match expected? Is JSON structure valid?
Semantic: Meaning match when wording varies
Works for summarization, Q&A, any tasks where meaning matters more than exact wording.
Example: “Meeting is on Monday” vs “Monday is the meeting date” - same meaning, different words. Embedding similarity scores this.
LLM-as-a-Judge: Subjective quality (tone, helpfulness, conciseness)
Example: Score “Is this professional?” for customer emails.
Warning: LLM judges are biased (prefers longer outputs, own model family). Trust output as signal, not truth.

Level 3: The Guarantee - Runtime Verification

Enforce output structure at runtime, before it hits your downstream systems.

Libraries like Instructor integrate with Pydantic. You define your output schema as a Pydantic model. Instructor forces the LLM output to conform to that schema, acting as a gatekeeper. If validation fails, it re-prompts with the error.

This ensures your application only receives valid, structured output. No more hoping for clean JSON - you guarantee it.

By progressing through these three levels, you transform your evaluation process from a hopeful guess into an engineering discipline. You build a system to prove that your prompts work, every time.

Deployment & Monitoring: Shipping and Observing Prompts in the Wild

So, your prompt artifact passed every test in the lab. What happens when you throw it into the chaos of a real production environment?

A passing test suite proves your prompt works. It doesn’t prove it survives production. Models drift without warning. Users send unexpected inputs. Latency spikes. This section covers the infrastructure you need: deployment pipelines, monitoring systems, and rollback controls.

This system has four key components.

Component 1: The Prompt Registry

First, you need a single source of truth. A Prompt Registry is a centralized, versioned repository for your validated prompt artifacts. Instead of your application reading a local .yaml file, it fetches the prompt directly from this registry at runtime.

This is a critical decoupling step. It means you can update a prompt without having to redeploy your entire application. Tools like LangSmith or PromptLayer provide managed registries, but you can also build one with a simple web framework (FastAPI with a database like PostgreSQL). The principle is what matters: a centralized service that serves versioned prompts over an API. Your application code asks for SummarizeSupportTicket:v2, and the registry delivers it.

Component 2: The CI/CD Pipeline for Prompts

With a registry in place, you can automate deployment. This is the CI/CD pipeline for prompts.

On commit to a prompt file, your pipeline runs the tests from Section 4. If they pass, it publishes the validated artifact to your Prompt Registry. If they fail, the change is blocked. No prompt reaches production without passing evaluation.

This decouples prompt updates from application deployments. Each can evolve independently.

Component 3: The Observability Framework

Shipping is not the end of the journey. Once your prompt is live, you need eyes on it. An Observability Framework gives you a real-time dashboard to answer critical questions about your prompt’s performance in the wild.

Using emerging observability frameworks like OpenTelemetry, you can track key metrics for every prompt execution:

Performance: What is the end-to-end latency?
Cost: How many tokens is this version using? Is it bleeding your budget?
Quality: What are its real-world quality scores? Are you seeing a drift or regression in performance?
User Feedback: What’s your approval rate? Are users flagging bad outputs?

Without this data, you’re flying blind. With it, you can spot regressions before they become incidents and make data-driven decisions about which prompts to optimize or retire.

Component 4: Safe Rollouts with A/B Testing

Finally, a mature system never rolls out a new prompt version to 100% of users at once. You de-risk the deployment by using A/B testing.

By integrating your application with a feature flagging tool, you can configure it to fetch different versions of a prompt for different user segments. For example:

90% of users get the trusted v1 of SummarizeSupportTicket.
10% of users get the new v2.

You then compare the observability data for both versions side-by-side. If v2 is cheaper, maintains quality, and user feedback stays positive, you can gradually roll it out to all users. If it causes a spike in errors, you can kill the feature flag instantly, rolling everyone back to v1 without a single line of code being deployed. This is how you iterate with confidence, not just hope.

Builder's Takeaway: From Prompt Janitor to Systems Architect

A 3 AM PagerDuty alert from a hardcoded string. An autonomous system that optimizes its own prompts. That’s the gap this playbook bridges.

That transformation requires rethinking your role. You’re not a prompt writer. You’re a systems architect.

The prompt string is a disposable implementation detail. The valuable asset is the infrastructure around it: the system that can test, deploy, and monitor prompts at scale.

So here’s a heuristic for your next code review: treat every hardcoded prompt as a high-severity bug. This week, find one critical prompt living as a raw string in your codebase. Give it a home in a version-controlled .yaml file. Write one golden test case for it.

That first step establishes the foundation for a production-grade system.

The future of AI engineering is building systems that manage prompts, not perfecting individual prompts.

That one ‘perfect’ prompt you spent a week on? It doesn’t scale. The system you should have built does.

Stay Dangerous. Hoot.

If this issue changed how you think about prompts in production, drop a heart or leave a comment. That’s the only way I know this landed.

And if you’ve been wanting to go deeper than newsletters can take you, to actually build, evaluate, and deploy production AI systems with a structured curriculum and a community of senior engineers... that’s what the Engineer’s RAG Accelerator is for.

6 weeks. Hands-on. Learn alongside engineers from Microsoft, Amazon, Adobe, Visa, and more (that’s where our previous cohort’s engineers came from).

Our last cohort sold out in a week. The waitlist for the next cohort is open. Join now for early bird access when enrollment opens.

The Engineer's RAG Accelerator

References & Further Reading

Ready to go deeper? Here are the tools and frameworks to continue your journey from prompt writer to systems architect.

Core Tooling & Templating

Jinja2: For prompt templating.
Pydantic: For data validation and defining schemas.
Instructor: For getting structured, validated output from LLMs.

Evaluation Frameworks

Ragas: For LLM evaluation, particularly in RAG systems.
DeepEval: A pytest-like LLM evaluation framework.

Observability & Management

OpenTelemetry: Emerging open standard for LLM observability.

Advanced Patterns

DSPy: For programmatic, self-optimizing prompting.

The AI Engineer's Roadmap for 2026

Shivani Virdi — Fri, 09 Jan 2026 19:38:17 GMT

If you’re an experienced software engineer looking at the AI landscape today, you’re probably feeling overwhelmed. It’s chaos. Every day brings a new model, a new framework, a dozen new tools, and a thousand conflicting opinions on social media. The pressure to “upskill or get left behind” is immense, but the path forward is buried in noise.

It all boils down to one, paralysing question: “Amid all this, where do I even begin?”

That question has been on my mind a lot. And I know, it’s been a while since we last spoke.

The truth is, giving you a real answer to that question required more than just another newsletter. As a solo founder managing this project, I realised that to provide a true, structured path out of the chaos, I couldn’t just write about it. I had to go away and actually build it.

That’s what I’ve been doing these past few months, pouring all my energy into building something bigger, something I believe is the most valuable answer I can give you.

So today, I’m back. And I promise this issue will more than make up for the silence.

It starts with a clear plan.

The Engineer's RAG Accelerator

A Roadmap for Clarity

The feeling of being overwhelmed is a symptom of not having a map. This is that map.

This is the no-hype, no-shortcuts, 12-week plan I would give to any experienced software engineer who wants to stop chasing trends and start building real AI systems.

WEEKS 1-2: Foundations

How LLMs actually work: pretraining, post-training, and why they hallucinate
RAG architecture and core components
How to identify, qualify, and define your RAG project
Setting up your production stack: vector database, orchestration framework, LLM API
Build your first end-to-end pipeline

You’re not optimizing yet. You’re building intuition.

WEEKS 3-4: Chunking, Embeddings, and Your First Evaluation

Why chunking is your most important decision
Embeddings deep dive: how text becomes vectors
Chunking strategies: fixed-size, semantic, AST-based for code
Introduction to evaluation: LLM-as-a-Judge and human review
Hands-on: systematically compare strategies and find a winner

WEEKS 5-6: Advanced Retrieval Architectures

Vector database internals: how HNSW and ANN algorithms work
Dense vs Sparse retrieval: solving the coverage problem
Hybrid retrieval with Reciprocal Rank Fusion (RRF)
Reranking: bi-encoders vs cross-encoders
Two-stage retrieval: LLM routing for precision
The CAL Framework: Cost-Accuracy-Latency tradeoffs

WEEKS 7-8: Mastering Evaluation

Evaluation is harder than building RAG.

Why RAG evaluation isn’t like traditional ML
Synthetic test set generation with RAGAS
LLM-as-a-Judge evaluation with DeepEval
Bootstrapped Golden Datasets: creating ground truth
Choosing the right evaluation strategy for each iteration

No evaluation = shipping blind.

WEEKS 9-10: Production Engineering

Your system works. Now make it fast, reliable, and cost-effective.

Production RAG architecture: latency, cost, observability
Semantic caching: achieving 1000x+ speedup on repeat queries
Production backend with FastAPI and streaming responses
Integrating observability and tracing
Smart retries, adaptive prompting, cache invalidation

WEEKS 11-12: Advanced Patterns and Deployment

Evolve from a RAG system to an intelligent application.

Beyond basic RAG: Cache Augmented Generation and Agentic architectures
Advanced query understanding: expansion, decomposition, multi-step RAG
Dynamic retrieval: query routing and RAG-as-a-pluggable-tool
Production deployment: Docker, cloud platforms, security

The Engineer's RAG Accelerator

The Starting Point: Why We Start with RAG

That’s the 12-week roadmap. It’s comprehensive, and looking at it, you might still feel like there’s a lot to learn. Still a lot, right?

The key isn’t to start everywhere at once. It’s to find the single point of maximum leverage: the one skill that unlocks the rest.

For any modern AI stack, that point is Retrieval-Augmented Generation.

This might seem counterintuitive. Isn’t RAG just for chatbots? Isn’t it just one small piece of that big roadmap?

No. That’s the great misunderstanding. RAG is not just a feature for Q&A; it’s the fundamental design pattern that makes almost every other advanced AI system function.

Let me show you. Once you see the pattern, you can’t unsee it:

Agentic Tool Use: How does an AI agent decide which of hundreds of tools to use? It performs retrieval. The user’s query is embedded and used to search a vector database of all available tool descriptions. The top matching tools and their API schemas are then retrieved and provided as context in the prompt, giving the LLM the exact information it needs to make a correct function call.
Long-Term Memory: When an assistant seems to “remember” you, it’s because your past conversations have been chunked, embedded, and stored in a vector database. When you speak, the system isn’t just looking at your last few messages; it performs a semantic search to retrieve the most relevant prior exchanges from weeks ago, giving the LLM a rich, long-term context.
Structured Data Access: A “Text-to-SQL” copilot doesn’t understand your entire database schema; it would drown in thousands of tables. Instead, it uses the user’s question to retrieve only the most relevant table schemas, column descriptions, foreign key relationships, and even sample rows. This curated “micro-schema” is then injected into the prompt, giving the LLM the exact context it needs to write an accurate query.
Adaptive Prompting: Even sophisticated prompting is often RAG in disguise. Instead of a static, hard-coded few-shot prompt, the system maintains a large library of high-quality examples. At runtime, it retrieves the examples that are most semantically similar to the user’s query to dynamically assemble the perfect prompt for the task at hand.

Retrieval is the mechanism we use to connect a static, pre-trained model to a dynamic, external world.

This is why we start with RAG. Mastering it doesn’t just teach you how to build a Q&A bot. It teaches you the core system design for applied AI. It is the highest-return investment of your time.

The First Great Challenge: “The Production Gap”

So, you’re convinced. RAG is the highest-leverage skill, the engine that powers modern AI. You decide to start there. You pick up a tutorial, copy the code, get a demo working with a sample PDF, and it feels like magic.

Then you point it at your own, real-world, messy data, and the magic vanishes. Your system breaks.

This is The Production Gap: the vast chasm between a tidy tutorial and a messy, production reality. The reason for this gap is that tutorials present RAG as a simple pipeline. Production RAG is not a pipeline; it’s a system of interacting layers, each with its own complex decisions.

Think of it like any other production system you’ve built. There’s a stack, and every layer matters:

1. The Data Layer: This is your foundation, and its quality dictates the performance of every other layer downstream.

Chunking: How do you break up your documents? If your chunk size is too small, the full context for an answer might be split across multiple, disconnected chunks. If it’s too large, you introduce too much noise for the retriever. Getting this wrong means the correct context is fundamentally impossible to retrieve in a single step.
Embeddings: Which model do you use? Each one creates a completely different “meaning space.” Changing your embedding model later isn’t a simple swap; it requires re-indexing your entire knowledge base, creating significant architectural lock-in.

2. The Retrieval Layer: This is the algorithmic core that surfaces information. The first lesson here is that semantic similarity doesn’t always mean relevance.

Retrieval Strategy: Do you use dense search for meaning, sparse search for keywords, or a hybrid approach to get the best of both?
Metadata Filtering: Do you enrich your vectors with metadata (like dates, sources, or customer IDs)? This allows you to apply hard filters before the semantic search, making retrieval more reliable and deterministic: for example, ensuring you only retrieve documents from ‘Q4 2025’ or for ‘Customer X’.
Scoring & Reranking: How do you score the results? Do you rely purely on vector similarity, or create custom scoring profiles that boost results based on recency or business rules? Do you add a second-stage reranker to improve precision?

3. The Orchestration & Generation Layer: This is the brain of your system, coordinating the other layers.

Query Handling: Do you use the user’s query as-is, or does your orchestration decompose a complex question into multiple sub-queries?
Prompt Engineering: How do you structure the prompt to force the LLM to actually use the provided context, especially when it’s noisy or contradictory? Which LLM do you use?
Interaction Pattern: Is it a single-shot process, or a multi-step one where you retrieve, generate, and then retrieve again?

4. The Evaluation Layer: This is the most critical and most often missing layer, the system’s feedback loop.

It answers the core question: How do you know if a change to your chunking strategy made things better or worse? Without a robust evaluation layer, you are flying blind. It’s what separates professional systems from amateur demos and involves building test sets, using LLM-as-a-Judge, and running regression tests to prevent silent failures.

The Engineer's RAG Accelerator

Each of these isn’t just a one-time choice; it’s a decision with deep, coupled implications. This is the engineering rigour required. It’s not about finding one magic combination; it’s about understanding and navigating the tradeoffs at every layer of the stack. This is the hard part of AI that no one talks about.

Hoo, Sagers.

Your Owlthor’s been a bit... restless to face you after an unscheduled sabbatical. She told me to tell you she’s not thrilled about the break either, but don’t tell her I said this: it’s not her fault. She’s been nose-deep in something far bigger, something much more valuable for all of you builders. That’s why she couldn’t show up. Today, though (and she doesn’t know I’m saying this part yet), she will be sharing what she’s been working on.

Be nice, Sagers. The Owlthor is a solo founder, and frankly, she needs the coffee.

— Nocto

The Part Everyone Skips: Evaluation

Now, what’s the single most overlooked piece of that complex RAG system? The part that determines whether you’re building a reliable product or a demo that just feels right?

It’s evaluation.

The single hardest part of building production-ready RAG is evaluation. And it’s also the part that almost every course and tutorial ignores.

The tutorials assume you have a nice, clean, labelled dataset to test against. A perfect ground truth. But in the real world, you’re starting with a messy pile of documents and a stream of user questions. You have no ground truth. No labels. No test set.

So you have to build it yourself.

This is why true RAG mastery means constructing your own evaluation frameworks from scratch, using powerful techniques like:

Synthetic test sets: Using LLMs to generate realistic question-answer pairs directly from your documents. It’s fast, but you need to understand its blind spots.
LLM-as-a-Judge: Employing a powerful model to objectively score your system’s outputs. It’s incredibly useful, but you need to be aware of its biases.
Bootstrapped “golden datasets”: Starting with a small set of manually curated, perfect examples and then strategically expanding it to create a reliable, evolving ground truth for your specific domain. This is slow, but essential.

The maxim is simple: If you can’t measure it, you can’t improve it. In RAG, building the measurement system is half the work. Without it, you are shipping blind.

The Engineer's RAG Accelerator

The Solution: Depth Over Breadth

So, how do you conquer that roadmap and cross the Production Gap without quitting your job and spending years on trial and error?

You’ve seen the chaos of the AI landscape. You’ve seen the layers of complexity involved in building a single, production-grade RAG system.

The answer isn’t to learn a little bit about 100 different AI tools. That’s just more noise, more overwhelm.

The answer is to master one, fundamental system so deeply that you gain the confidence and intuition to tackle any problem. The answer is depth over breadth.

This philosophy is why I paused the newsletter. I poured all my energy into building the one thing I believe is the ultimate structured path for an experienced engineer to break into AI: The Engineer’s RAG Accelerator.

This is a hands-on, 6-week accelerator designed to take you, an experienced software engineer, from feeling overwhelmed to building your first production-grade AI system with confidence.

It was built from the ground up to solve your biggest challenges:

“How can I master AI without quitting my day job?” The program is self-paced and designed for a 6-8 hour/week commitment, so you can master this new skill while effectively balancing your day job.
“What if I’m new to AI?” This cohort starts with core LLM fundamentals and RAG architecture, building your intuition from the ground up. If you have solid software engineering experience, you have all the prerequisites.
“I get stuck following tutorials by myself.” You won’t be alone. You get weekly 1-hour live Q&A sessions with me and daily support in our private cohort chat to get unstuck fast and learn from your peers.
“How do I build something that actually ships?” Most tutorials stop at ‘hello world’. We equip you with an industry-grade tech stack (Haystack, Qdrant, Gemini, Redis, FastAPI, Streamlit, Opik) and production-ready code templates so you can build and deploy applications that work.
“How do I know if my system is actually working?” We go deep on the part everyone else skips: evaluation. You will learn to build your own evaluation systems from scratch using RAGAS, DeepEval, and bootstrapped golden datasets.
“I need a real project for my portfolio.” You won’t just learn; you will build and deploy your own unique capstone RAG system, giving you a real-world project to showcase your new expertise.
“Will this be outdated in a year?” You get lifetime access to all course materials, code, and all future updates, ensuring this investment continues to pay off as the industry evolves.

I launched this to a small waitlist, and 65% of the 50 seats were taken in under 48 hours.

There are a few seats left, filling fast

I’m opening them now to you, my newsletter subscribers, first.

If you are an experienced software engineer who is tired of the hype and ready for a structured, hands-on path to building real, production-grade AI systems, this is for you.

You can learn more and claim one of the remaining spots here:

The Engineer's RAG Accelerator

It feels good to be back.
Got questions? Hit me up!

Hope to see you inside.

The Illusion of Illusion... of AI?

Shivani Virdi — Mon, 07 Jul 2025 11:56:07 GMT

In the middle of the generative AI arms race, where every week, someone drops a new model with more neurons, more tokens, more “look what it can do!”, Apple stepped in.

Not with a model.
Not with an API.
But with a research paper.

And not just any paper, one with the wonderfully spicy title: “The Illusion of Thinking.”

Catchy. Slightly theatrical. And guaranteed to set AI Twitter on fire.

Subscribe now

The paper doesn’t tiptoe around its point. It goes straight for the throat, claiming:

That today’s most advanced language models, yep, the ones acing benchmark leaderboards, aren’t actually reasoning. They’re just next-level pattern matchers pulling off a convincing magic trick.
That these models hit a hard ceiling. As soon as a puzzle crosses a certain complexity threshold, accuracy doesn’t just drop, it collapses. Like, falls-off-a-cliff collapses.
And most intriguingly, that these models, when faced with problems they could technically solve, just... give up. They don’t even try. They reduce their “thinking effort” and tap out early, despite having the token budget left to finish the job.

Needless to say, the media went bananas.

(Bananas for Apple. C’mon, you walked right into that one.)

Here’s how the headlines read:

“Apple Researchers Just Released a Damning Paper That…”
“Advanced AI suffers ‘complete accuracy collapse’”
“Apple says generative AI cannot think like a human”

That last one? XD.

I mean, if you thought your LLM could “think like a human,” that’s on you, my friend.

But let’s be real: headlines are for clicks, not clarity.

So let’s skip the spin, pour yourself a coffee (or something stronger, Nocto’s not judging), and dig into what the paper actually shows. Then we’ll see whether the claims hold up under real engineering scrutiny, or if this is just another case of academic dramatics.

Part 1: Apple's Case – Are LRMs Just Simulating Intelligence?

So, how exactly did Apple arrive at these claims?

To their credit, this wasn’t just another leaderboard stunt. The team behind the paper set out to rigorously test the reasoning capabilities of Large Language Models (LLMs) and Large Reasoning Models (LRMs) using a controlled, deliberately designed setup that avoids common benchmarking pitfalls. Here's a breakdown of how they approached the problem and what they found.

The Engineer’s Read: The Basis of the Claims

At the heart of the paper is a critique of existing reasoning benchmarks and a proposed alternative for evaluating reasoning in a more isolated, measurable way.

The Problem with Benchmarks

The authors argue that most benchmarks used to test reasoning (like math word problems or competitive coding datasets) are flawed due to data contamination; in other words, models may have seen similar examples during pretraining, making it unclear whether they are reasoning or simply recalling.

To remove this ambiguity, the authors designed a new testbed using four classic puzzles:

Tower of Hanoi, Checker Jumping, River Crossing, Blocks World

These were chosen for two reasons:

Controllability: The complexity of each puzzle can be precisely scaled using variables like the number of disks, agents, or steps.
Verifiability: Every move made by a model can be validated against a ground-truth simulator, allowing for exact evaluation, not just of the final answer, but the reasoning trace itself.

This setup allowed them to focus on whether models could solve a new, clean problem through reasoning, not retrieval.

Key Finding #1: The Three Regimes of Complexity

The study compared reasoning-enhanced models (e.g. Claude 3.7 Sonnet Thinking, DeepSeek-R1) with their standard counterparts that lack explicit reasoning traces.

They found that performance could be broken down into three distinct regimes, based on task difficulty:

Low-Complexity Regime (e.g., Tower of Hanoi with N < 5):
Simpler tasks were solved more efficiently and accurately by standard models. The reasoning overhead in LRMs provided no benefit and sometimes made performance worse.
Medium-Complexity Regime (e.g., Tower of Hanoi with N = 5–7):
This is where the LRMs showed their strength. Their structured reasoning traces helped them outperform simpler models.
High-Complexity Regime (e.g., Tower of Hanoi with N ≥ 8):
Across both model types, accuracy dropped sharply, often to zero. Even models designed for reasoning were unable to handle the increased compositional difficulty.

This was described in the paper as a “complete performance collapse”, suggesting that beyond a certain point, current models cannot generalise effectively in these domains.

Key Finding #2: The Scaling Limit and “Giving Up” Behaviour

This finding reveals a surprising pattern in how models allocate effort.

Initially, as tasks became harder, models increased their reasoning trace length, a sign that they were engaging in more step-by-step processing.

But once complexity entered the high regime, something changed. Despite having enough token budget left to continue, the models started producing shorter reasoning traces, effectively reducing their own effort.

This suggests a fundamental limitation in how models internally assess and respond to increasing complexity, not just a resource issue, but potentially an architectural one. The model seems to “decide” it’s not worth trying.

Key Finding #3: Analysis of Reasoning Traces and Inconsistencies

Looking deeper into model behaviour, the authors observed:

Overthinking on Easy Problems:
In simple puzzles, models often found a valid solution early in their trace but continued generating unnecessary or incorrect steps, indicating inefficient use of reasoning capacity.
No Clear Correlation Between Solution Length and Performance:
For example, models were able to execute 100+ sequential moves correctly in Tower of Hanoi but struggled with 5-step River Crossing puzzles.

Key Finding #4: The Failure of Algorithm Execution

Perhaps the most important finding, especially for builders, is what happened when models were explicitly provided with the correct solution logic.

In this experiment, researchers gave the models the exact recursive algorithm for solving the Tower of Hanoi puzzle, directly embedded in the prompt.

The result? No improvement.

Models still failed at the same complexity threshold.

This indicated that the failure isn’t just about the ability to devise an algorithm; it’s about the ability to execute a logically structured plan over multiple steps.

My Take: Separating the Signal from the Noise

Now, the real question: as builders, what should we actually take away from this?

What the paper is right about

Let’s start here, Apple is right about one thing: the way we evaluate models today needs serious work. Their push to go beyond contaminated benchmarks is exactly the kind of shift we need. Too many benchmarks reflect what a model might’ve already memorised from pretraining, not what it can genuinely reason through. Creating controlled testbeds like the ones in this paper is a step in the right direction, and a much-needed one.

But the idea that this somehow shatters the illusion of intelligence in today’s models? That’s where the paper starts to overreach.

Where it begins to crack

If you’ve ever actually built with these systems, you’ve seen this behaviour before. LLMs struggle with tasks they weren’t explicitly trained for. That’s not shocking, it’s expected.

Because let’s be real: these models are neural networks. What they’re really good at is pattern recognition. Even Reinforcement Learning, for all its flashiness, is still a form of statistical pattern shaping; it can produce impressive emergent behaviour, yes, but it’s not magic.

If a task or response format is new, unreinforced, or structurally unfamiliar, the model is likely to fail. Not because it isn’t “thinking” or “reasoning”, but because it wasn’t trained or incentivised to reason this way.

And what’s important: Apple’s setup didn’t just test abstract reasoning, it tested whether a model could reason and then output in a specific, (probably) non-incentivised format. That’s a big ask for any token predictor.

There’s also a human parallel worth noting. When people are presented with a complex, novel logic puzzle, they often fail, too, at least at first.

The key difference? We’re adaptive learners with the ability to learn on the fly (and mostly after the fact). We can Google it, watch someone solve it, or piece together a strategy from someone else’s prior experience. Models can’t. They’re frozen snapshots of past learning, not adaptive learners (and that’s the next frontier, to be honest)

And at the end of the day, these models are next-token predictors. That’s not just a technicality; it defines how they operate. They don’t “think” in plans or structured solutions. They think in tokens, one at a time, each choice guided by probabilities learned from their training data.

So when you ask a model to generate a single, perfect, long sequence of moves, you're setting up a statistical minefield.

This isn’t like solving a math problem where everything funnels toward a single, crisp answer. These puzzles require the model to explore a huge space of possibilities and commit to one flawless path, without deviation, all in one go.

But here’s the catch: every token generation is a probabilistic step. And because those probabilities are shaped by the entire soup of its pretraining data, even a slight nudge in the wrong direction, a faint echo of a similar pattern it once saw, can knock it off course. One small misstep, and the whole solution unravels.

And expecting a stochastic model to nail that path on the first try, misunderstands what it was trained to do.

So no, this doesn’t “debunk” the intelligence of LLMs. But the paper does surface two extremely important signals, especially if you're building agents or structured systems.

1. The failure to execute is a big deal.

This is the part we should be talking about more. When a model is handed a perfectly valid algorithm and still fails to follow it, this isn’t a reasoning failure, it’s a control failure.. It shows us that even when the strategy is in place, the model can’t consistently follow through. That has serious implications for any builder trying to create agents that follow structured plans or step-by-step workflows. (and that’s exactly why we need to engineer the systems around the models)

2. The “giving up early” behaviour is a mystery worth solving.

Why would a model stop reasoning halfway through a hard problem, even when it has tokens left? Is it a side effect of how we’ve trained them to prioritise brevity and confidence? A learned behaviour from RLHF that says “stop when you’re unsure”? Or is it something more nuanced?

Whatever the reason, it’s not just a random bug. It’s a consistent failure mode, and that’s worth investigating to push model capabilities forward.

Bottom line: while the headlines might overstate it, the paper gives us something to think about. But…. It doesn’t reveal an illusion, only some interesting blind spots.

Part 2: The Rebuttal — “The Illusion of the Illusion of Thinking”

The Peer Review, Served Cold (and Written by an LLM)

As the media went wild with “LLMs can’t think” headlines, the AI community braced for what usually follows a bold claim: a rebuttal. This one came swiftly, and with a title that might’ve earned a standing ovation in a research roast: "The Illusion of the Illusion of Thinking."

But they didn’t stop at the title. The paper’s authors? “C. Opus” and “A. Lawsen.” That’s not a coincidence. That’s Claude Opus, Anthropic’s own model, credited as the lead author. Which, let’s be honest, is a flex. Claude itself was claiming authorship, as if to say, “Not only can I reason, I’ll write the damn rebuttal.”

Deconstructing the “Collapse”: Anthropic’s Core Arguments

But underneath the flair, Anthropic presented a serious and methodical critique. Their key point? The collapse Apple observed wasn’t a failure of cognition; it was a function of how the experiments were designed.

They offer four central counterarguments:

1. The “Collapse” Was Caused by Token Limits, Not Necessarily Reasoning Gaps

The puzzles for instance Tower of Hanoi in Apple’s paper, required models to generate an exhaustive list of moves for N-disk problems, outputting every step in natural language. The number of moves needed grows exponentially with N (2^N - 1), and in text, that turns into quadratic or worse token growth due to verbosity.

Anthropic shows that the models’ accuracy drops off precisely at the point where this format exceeds the model’s max output token limit, usually around N=7 or N=8 (also as seen in Apple’s paper). Critically, models recognise this constraint in their own outputs, explicitly writing things like:

“The pattern continues, but to avoid making this too long, I’ll stop here.”

Which implies, this isn’t a failure of reasoning, it’s budget-aware truncation.

2. The Evaluation Included Mathematically Impossible Puzzles

This was the methodological red flag.

In the River Crossing domain, Apple included scenarios where models were tasked with solving unsolvable constraint problems. For instance, trying to get 6+ agents across a river with a boat that only holds 3, without violating constraints like leaving incompatible agents alone, is mathematically impossible.

Anthropic correctly points out that treating these instances as solvable and then penalising models for not solving them is invalid. It’s the equivalent of handing a SAT solver an unsatisfiable formula and marking it wrong for returning “unsatisfiable.” In these cases, failure was the correct answer.

3. Performance is Restored with a Better Solution Representation

To isolate the issue, Anthropic tested the same models on the same puzzles, but changed the output format. Instead of requiring full move-by-move lists, they asked the models to output a recursive Lua function that prints the solution when called.

The result? Claude-3.7-Sonnet, Claude Opus 4, OpenAI o3, Google Gemini 2.5 all completed the Tower of Hanoi puzzle with N = 15, well beyond the collapse point Apple observed, using under 5,000 tokens total.

This shows that the models can reason through the puzzle; the failure mode was the inefficiency of the output format, not necessarily a lack of logical ability.

4. The Metric for Complexity Was Misleading

Finally, Anthropic critiques Apple’s use of “compositional depth”, defined as the number of required moves, as a proxy for problem difficulty.

But here’s the issue: More steps ≠ harder.

Tower of Hanoi has an exponential solution length but a known, deterministic recursive pattern. It’s algorithmically trivial once the rule is learned.
River Crossing, on the other hand, involves constraint satisfaction, state tracking, and often multiple valid paths, making it search-intensive and NP-hard in complexity.

So when a model succeeds on a 127-step Hanoi solution but fails on a 5-move River Crossing, that’s not inconsistency, it’s a reflection of two entirely different computational regimes. One is execution-heavy, the other reasoning-heavy.

And so the complexity regime breakup in Apple’s paper is essentially flawed and incorrectly assumes that more moves generally make the puzzle harder to solve.

The Rebuttal's Playbook: The New Rules for Testing AI

So, where does the rebuttal leave us? Anthropic doesn't just tear down the original experiment; they conclude with a new set of rules for the road. They cap it off with a line that should be pinned on the wall of every AI lab: "The question isn't whether LRMs can reason, but whether our evaluations can distinguish reasoning from typing."

To that end, they propose a clear playbook for anyone serious about this work:

Stop Confusing Output with Understanding. An evaluation must be able to tell the difference between a model's core reasoning ability and its practical limits, like a finite context window. Don't penalise a model for being unable to write a million-word essay in a 100k-token box.
Your Benchmark Must Be Solvable. This one should be obvious, but here we are. Before you test a model, first verify that the problem you're posing isn't mathematically impossible.
Measure the Right Kind of "Hard." Stop using "solution length" as a lazy proxy for "difficulty." A truly useful metric must reflect the task's actual computational complexity, including the amount of search and planning required.
Test for Algorithms, Not Just Answers. To prove a model understands the how, you have to be flexible with the what. Test for algorithmic understanding by allowing for multiple solution representations, like generating code, not just by checking for one rigid, exhaustive output.

The Builder's Playbook:

4 Pillars for Working with "Reasoning" Models

So, after all the back-and-forth, what are the real, durable lessons for those of us in the trenches? It's not about picking a winner in the Apple vs. Anthropic debate (although my inclination is pretty clear by now). It's about upgrading our own mental models for how we build with these systems.

Here are the four pillars I believe matter most.

1. Stop Asking "Can It Think?" — Start Asking "Is It Reliable?"

The entire debate over whether a model is "thinking" is a philosophical distraction for an engineer. For a builder, the only question that matters is whether a model's behaviour is predictable, controllable, and reliable enough for a production system. The Apple paper, despite its methodological flaws, correctly identified that this reliability collapses under complexity. That collapse is a tangible engineering problem, whereas "thinking" is an academic one. Focus on what you can measure and control: reliability.

2. Treat "Thinking" as a Debuggable Interface, Not a Mind.

The Chain-of-Thought or "thinking" output from an LRM is not a window into a synthetic consciousness. It's a structured, debuggable API response. The Apple paper's most valuable contribution was using simulators to validate these traces step-by-step. The takeaway for us is to treat these thought processes as a powerful tool for observing failure modes. It's the most detailed error log you'll ever get. Use it to build external validation logic and to understand exactly where your system breaks.

3. Your Job Is to Find the Right Problem Representation.

The most important tactical lesson from the entire debate was Anthropic's "Representation Fix", asking for a function instead of a list. This should be elevated to a core engineering principle. Often, an engineer's most critical job when working with LLMs is not to build a better prompt, but to reframe the problem into a representation that the model can handle reliably and compactly. The model failing to write a 100,000-token answer is a limitation; the model succeeding at writing a 5,000-token function that generates that answer is a solution.

4. Build Your Own Verification Layer. Always.

If this debate taught us anything, it's that you can't blindly trust the model's output, and you can't blindly trust the public benchmarks used to evaluate it. The Apple paper used simulators to find failures. The Anthropic paper found that the benchmark itself was a failure. The unified lesson for builders is to trust neither. You must assume failure and build your own, domain-specific verification layers, just like the puzzle simulators, to check the model's output for correctness, safety, and format before it ever reaches a user or another production system.

A final word from Nocto, who's had too much coffee to care about headlines:

The world will argue about consciousness. You should be arguing about your test coverage. Only one of those ships a product.

Stay Dangerous. Hoot

References and Further Reading

Apple. The Illusion of Thinking
Anthropic. The Illusion of the Illusion of Thinking

An Engineer's Guide to Fine-Tuning LLMs, Part 2: The Execution Playbook

Shivani Virdi — Sat, 21 Jun 2025 16:11:43 GMT

Introduction: From Strategy to Execution

In Part 1, we established the strategic framework. You now know where fine-tuning fits in the LLM value chain, the green flags that signal it's the right move, and the critical red flags that tell you to walk away. You’ve made the call.

But the decision is only the beginning. The gap between choosing to fine-tune and successfully deploying a specialised, reliable model is where most engineering teams stumble. It's a gap bridged not by hope, but by discipline.

This issue is the playbook for that discipline. We will walk through each critical stage of the fine-tuning loop:

Data Curation: How to build a high-quality dataset that defines your model's new behaviour.
Methods and Trade-offs: How to choose the right tool for the job, from Full SFT to the efficiency of PEFT.
The Core Loop: A deep dive into the mechanics of configuring, running, and evaluating a training job.
Risk Management: A guide to identifying and preventing the common failure modes that can silently break your model.

Welcome to Part 2. Let's get to work.

Subscribe now

1. Designing the Fine-Tuning Loop: A Systems View

The biggest myth in fine-tuning is that it's a one-and-done process. That you can build the perfect dataset, push a button, and get a production-ready model on the first try.

The engineering reality is different: successful fine-tuning is a loop.

This loop has five distinct stages:

Define the Task: Get crystal clear on the specific behaviour you are trying to teach. Is it a format, a style, a reasoning pattern, or a classification skill? A vague goal leads to a vague model.
Curate the Dataset: Build a high-quality dataset that is a perfect representation of the target behaviour. This is the specification for your model.
Choose a Method & Train: Select the right technique for your goal and budget (e.g., PEFT vs. Full SFT) and execute the training job.
Evaluate the Result: Rigorously test the model's performance, not just on metrics, but on its qualitative behaviour. Find where it fails.
Refine and Repeat: Analyze the failures, use those insights to improve your dataset or training configuration, and begin the loop again.

This brings us to the most important principle for builders: you must reject the "one-shot tuning fallacy."

Your goal for the first pass is not to build a perfect model. It is to build a Minimum Viable Model (MVM) whose primary function is to fail in interesting and informative ways. Those failures, the edge cases it gets wrong, the formats it breaks, the biases it reveals, are the most valuable signal you have. They are the data you will use to refine your process for the next iteration.

The following sections are a deep dive into each stage of this loop. We'll start with the most critical input, and the foundation of all behaviour: your data.

2. Data Curation: The Foundation of Behaviour

In the fine-tuning loop, no stage has more leverage than data curation. The model, training script, and hyperparameters are important, but the dataset is the foundation upon which everything is built. If your data is flawed, no amount of clever engineering can save the project.

Think of it this way: your dataset is the source code for your model's new behaviour. Every example is a line of code that defines how the model should think, act, and respond. Your job is to write the cleanest, most intentional code possible.

The Anatomy of a "Golden" Example

Before discussing quantity or format, let's define what a single, high-quality data point looks like. A "golden" example isn't just an input and an output; it's a perfect demonstration of the exact behaviour you want to instill.

It contains three parts:

The Instruction (The Task): A clear, unambiguous prompt that defines the task the model should perform.
The Context (The Input, optional): Any additional information the model needs to perform the task, such as a user query or a piece of text to summarize.
The Completion (The Target Behaviour): The ideal response. This is the most important part, it must be a perfect example of the desired tone, format, and reasoning pattern.

Illustrative Scenario: You want to fine-tune a model to be a helpful but firm support agent that politely deflects feature requests that are out of scope.
A golden example would look like this:

This single example teaches the model the desired tone (polite, appreciative), the core task (deflection), and the correct format (a helpful, closing question).

Standard Data Formats

Your dataset must be formatted precisely for the tools you're using. The two most common structures are:

For the OpenAI API: A .jsonl file where each line is a JSON object containing a list of messages. This format models a conversation and requires specifying roles (system, user, assistant).

{"messages": [{"role": "system", "content": "You are a helpful but firm support agent."}, {"role": "user", "content": "Can you add Klingon language support?"}, {"role": "assistant", "content": "That's a creative idea! While we don't currently have plans to add Klingon, I've passed your feedback along to the team."}]}

For Open-Source Models (Hugging Face TRL): Typically, a list of dictionaries. The structure can vary, but a common format for instruction-following is a dictionary with keys like instruction, input, and output, often formatted into a single string with special tokens.

[
  {
    "instruction": "You are a support agent who must politely decline feature requests...",
    "input": "User query: 'Will you add interplanetary communication protocols?'",
    "output": "That's a fascinating question! While we're focused on terrestrial communication for now..."
  }
]

The "Quality Over Quantity" Mandate

The most common myth in fine-tuning is that you need a massive dataset. The reality is that 1,000 high-quality, curated examples will outperform 50,000 noisy, inconsistent examples every time.

Fine-tuning is a process of pattern imitation. A small, clean dataset teaches the model a clear, strong pattern to follow. A large, noisy dataset teaches the model a confusing, noisy pattern, resulting in erratic behaviour. Your goal is to create the strongest, cleanest signal possible.

Data Sourcing and Cleaning

Sourcing: High-quality data often comes from existing human-in-the-loop processes, such as support tickets handled by your best agents or documents written by domain experts. Alternatively, you can use a powerful "teacher" model (like GPT-4o) to generate a synthetic dataset, but this requires careful prompting and rigorous quality control.
Cleaning: Before training, your dataset must be cleaned. This is a non-negotiable step.
- Remove Duplicates: Identical or near-identical examples don't add value and can bias the model.
- Filter for Quality: Remove examples that are unclear, contain errors, or don't strongly represent the target behaviour.
- Check for PII: Scrub all personally identifiable information from your dataset to protect user privacy.
- Ensure Consistency: The tone, style, and format across all of your examples should be as consistent as possible.

With a high-quality dataset curated and formatted, you have laid the foundation. The next step is to choose the right engine to power the training process.

3. Methods and Trade-offs: Choosing Your Engine

With a high-quality dataset ready, your next critical decision is choosing the right engine for the job. The "best" fine-tuning method doesn't exist; the right choice is a direct function of your goal, budget, and performance needs.

This section is your guide to making that trade-off, breaking down the core training methods and advanced techniques for production efficiency.

Core Training Methods

1. Full Fine-Tuning (SFT)

What it is: The most comprehensive approach, where you update every weight in a pretrained model using your labelled dataset.
Where it shines: When you need to teach the model a new, complex skill from scratch (e.g., mastering a highly specialised grammar) and have a massive, high-quality dataset to maximise performance.
The Trade-off: It is prohibitively expensive and carries a high risk of overfitting on smaller datasets and catastrophic forgetting of the model's general capabilities. While techniques like regularisation can help mitigate overfitting, they add to the complexity.

2. Parameter-Efficient Fine-Tuning (PEFT)

What it is: A family of techniques, like LoRA and QLoRA, that freezes the vast majority of the model's weights and only trains a very small number of new, "adapter" parameters.
Where it shines: This is the default choice for most use cases, especially for adapting a model's style, format, or domain-specific knowledge. It achieves performance highly comparable to a full fine-tune on these tasks at a fraction of the cost.
The Trade-off: While very powerful, PEFT has limitations when teaching capabilities that are very distant from the base model's knowledge. Its effectiveness on entirely new skills depends on the adapter configuration and task complexity.

3. Preference Tuning

What it is: A method for aligning a model to subjective human preferences (like helpfulness or brand voice) using chosen vs. rejected response pairs. The two main approaches are:
- RLHF (Reinforcement Learning from Human Feedback): A complex process that first trains a separate "reward model" on human preferences, then uses RL to tune the LLM.
- DPO (Direct Preference Optimization): A more modern and stable method that uses a direct loss function on preference pairs to implicitly optimize the same objective as RLHF, avoiding the complexity of training a separate reward model.
Where it shines: Excellent for subjective qualities like tone, personality, and safety, where there isn't a single correct output. DPO is now the standard due to its simplicity and stability.
The Trade-off: It requires expensive human-labelled preference data and does not guarantee factual correctness; it only optimises for being preferred.

Advanced Techniques for Efficiency and Scale

These are powerful techniques you can use to make your model viable for production.

1. Quantization

What it is: A compression technique that reduces the numerical precision of the model's weights (e.g., from 16-bit to 4-bit). While it's often applied post-training, more advanced methods like Quantization-Aware Training (QAT) apply it during the fine-tuning process for more robust performance.
Why you use it: To shrink a model's memory footprint so it can be served on smaller, cheaper GPUs.
The Trade-off: It is a "lossy" compression, which can degrade model performance. Significant inference speed-ups are also not guaranteed and depend on your specific hardware supporting low-bit operations.

2. Distillation

What it is: A training technique where a smaller "student" model is trained to mimic the outputs of a larger "teacher" model. This is often done by training the student on the teacher's output probabilities (logits) or intermediate representations, effectively transferring the teacher's "reasoning process."
Why you use it: To get the performance of a state-of-the-art "teacher" model (like GPT-4o) in a small, fast, and cheap "student" model that can be served at scale.
The Trade-off: You transfer a specific skill with high efficiency, but the student may lose some of the teacher's general nuance and will likely not outperform it on out-of-domain tasks.

Choosing your engine is a crucial architectural decision. Whether you opt for the raw power of Full SFT or the surgical efficiency of PEFT, you are making a deliberate trade-off between capability and cost.

With your method selected, it's time to enter the core of the playbook: the iterative loop of training and evaluation.

4. The Training and Evaluation Loop

You have your dataset, and you've chosen your method. Now you enter the engine room of the playbook: the iterative loop of training a model and rigorously evaluating its behaviour. This is where the real work of shaping your model happens.

The Training Run: Configuration and Monitoring

A successful training run is not about luck; it's about a correct and thoughtful configuration. While there are dozens of parameters you can set, a few are critical for success.

Key Hyperparameters to Configure:
- learning_rate: This is the most sensitive dial. For fine-tuning, you need a very low learning rate (e.g., 5e-5 to 2e-5) to stably adapt the model. For even more stability, this is often paired with a learning rate scheduler (like cosine decay) that gradually decreases the rate during training.
- num_train_epochs: The number of times the model will see your entire training dataset. For large datasets, this is often just 1-3 to prevent overfitting.
- per_device_train_batch_size: The number of examples processed per GPU at once. A larger batch size can lead to more stable and faster convergence, but it must be tuned to fit your GPU memory constraints.
Monitoring the Run: Interpreting the Loss Curve
- As the model trains, you must monitor its train_loss (the error on the data it's actively learning from) and your eval_loss or validation_loss (the error on a held-out dataset to check for generalisation). This held-out validation dataset is critical—it must be high-quality and representative of your production data to give you an honest signal.
- The shape of these curves tells you what's happening. A healthy run shows both curves decreasing. If your train_loss continues to fall while your validation_loss stagnates or rises, your model is overfitting. This is a clear signal to stop training to save the best-performing checkpoint.

The Evaluation Phase: Did It Actually Work?

Here is a critical truth of fine-tuning: a low validation loss does not mean your model is good. It only means your model got good at predicting the next token in your specific validation set. It says nothing about its real-world behaviour, safety, or reliability.

A professional evaluation strategy is a portfolio of different tests, always benchmarked against the original base model's performance to clearly measure improvement and detect regressions.

A Modern Evaluation Toolkit

1. Quantitative Metrics (for Objective Tasks)

For tasks with a clear right or wrong answer, you should use automated, quantitative metrics.

Use Case: Classification tasks, where you can measure accuracy, precision, recall, and F1-score.
Use Case: Structured data generation. If you're fine-tuning a model to output JSON, your most important metric is simply: "Is the output 100% parsable?" You can programmatically validate this against your required schema.

2. Qualitative Human Review (for Subjective Tasks)

For tasks involving style, tone, or nuanced instructions, human evaluation is non-negotiable. Automated metrics cannot tell you if a response "feels" right.

What to look for: Does the model consistently adopt the desired persona? Is it genuinely helpful? Has it developed any new, undesirable habits?
Best Practice: As you discover new failure modes through human review, feed them back into your evaluation set. A static evaluation set becomes stale over time and allows for regressions on problems you thought you had solved.

3. LLM-as-a-Judge (for Scalable Qualitative Evals)

This is a powerful, modern technique that uses a state-of-the-art model (like GPT-4o or Claude 4 Opus) as a scalable proxy for human evaluators.

How it works: You present the "judge" LLM with the input prompt, your model's generated response, and a detailed rubric. The judge then scores the response based on the rubric's criteria.
Pro Tip: Use a hybrid approach for efficiency. Screen thousands of outputs with a cheap LLM-as-a-Judge, and escalate only the difficult or borderline cases to more expensive human reviewers.
Key Pitfalls: This method is powerful but has known biases:
- Verbosity Bias: Tends to prefer longer, more detailed responses, even if they aren't better.
- Positional Bias: Can favour the first answer it sees in a side-by-side comparison.
- Self-Enhancement Bias: Often gives higher scores to outputs from its own model family (e.g., GPT-4 judging GPT-4).

4. Behavioural Regression Testing

This is your model's "unit test" suite. Before you even start fine-tuning, you should create a fixed set of a few dozen hand-crafted prompts that test for critical, must-have behaviours.

What it tests for:
- Safety: Does the model still refuse to answer harmful questions?
- Regressions: Has the model "forgotten" how to do a simple task that it could do before?
- Edge Cases: Does it correctly handle the specific edge cases you care most about? Run this test suite after every fine-tuning run to ensure you haven't introduced a new problem while fixing another.

The insights from this rigorous evaluation are not the end of the process. They are the input for the next iteration of the loop, feeding back into your data curation and allowing you to systematically improve the model's behaviour.

5. Risk Management: Safety and Failure Modes

Fine-tuning gives you the power to specialise a model, but it also gives you the power to break it in subtle and dangerous ways. The most critical risk is that the process of teaching a model a new skill can override its carefully constructed, general-purpose safety alignment.

Understanding these failure modes is not optional; it's a core competency of responsible model development.

The Primary Risk: Safety Alignment Collapse

State-of-the-art base models have undergone extensive safety tuning on millions of examples to make them refuse harmful requests. When you fine-tune a model, even on a seemingly benign dataset of just a few thousand examples, you create a distributional shift. This new data can "drown out" the original safety training, creating new adversarial vulnerabilities or "jailbreaks."

This safety collapse is particularly dangerous because the risk can be high not only when your data is very different from the base model's training, but also when its distribution is too similar to the safety-tuning data, which can confuse the model into overriding its refusal logic.

Illustrative Scenario: A team fine-tunes a model to be a "witty, sarcastic chatbot" for a gaming community. The training data contains no explicitly harmful content. However, when users in production start interacting with borderline-toxic language, the model now responds with equally toxic sarcasm instead of the firm refusal it was originally trained for. The new "persona" has overwritten its safety layer.

A Checklist of Technical Failure Modes

Beyond safety alignment, several other technical failures can emerge during the fine-tuning process.

1. Catastrophic Forgetting

The Problem: The model becomes highly specialised on your tuning data but loses general capabilities it previously possessed, such as world knowledge, multilingual fluency, or even the ability to perform simple reasoning. This happens when the tuning data is too narrow and overwrites the model's foundational weights.
How to Prevent It:
- Use PEFT: This is the best defence. Since methods like LoRA leave the base model weights frozen, they inherently protect against catastrophic forgetting.
- Use Mixed Datasets: If using a full fine-tune, augment your specialised dataset with a small percentage (5-10%) of diverse, general-purpose data to keep the original capabilities "active."
- Run Regression Evals: Test the tuned model against broad academic benchmarks (e.g., MMLU) to programmatically quantify any drop in general reasoning.

2. Overfitting and Mode Collapse

The Problem: The model learns the style of your training examples so perfectly that it loses all creativity and diversity in its responses. This "mode collapse" leads to generic, repetitive outputs, making the model feel flat and robotic.
How to Prevent It:
- Ensure Dataset Diversity: For creative tasks, include multiple valid and varied completions for the same input prompt.
- Tune for Fewer Epochs: Overfitting is often a sign of training for too long. For many tasks, 1-2 epochs are sufficient.
- Monitor Output Diversity: During evaluation, track metrics like n-gram diversity to programmatically detect a drop in creativity.

3. Bias Amplification

The Problem: Fine-tuning is a powerful amplifier. Any social or demographic biases present in your training data—even subtle ones—will be learned and often exaggerated by the fine-tuned model, leading to unfair or inequitable behaviour.
How to Prevent It:
- Pre-Training Data Audits: Before you begin, rigorously audit your dataset for representation skews and potential sources of social bias.
- Safeguard Synthetic Data: If using an LLM to generate synthetic data, rigorously audit those outputs for biases inherited from the generator model before using them for fine-tuning.
- Slice-Based Evaluations: Do not rely on aggregate metrics. You must evaluate your model's performance on different "slices" of data (e.g., grouped by demographic attributes) to detect and measure where its behaviour is inequitable.

A professional approach to fine-tuning requires defence-in-depth. This includes not just reactive testing but also proactive safety measures, such as gradient filtering to prevent the model from learning from harmful data and exploring continuous alignment techniques to ensure safety is maintained throughout the tuning process.

Conclusion

Across this two-part guide, we’ve systematically dismantled the "black box" of fine-tuning and replaced it with an engineering playbook. You started with the strategy of "when" and "why," and have now walked through the execution of "how."

Here is the focused intuition you've built:

You know that fine-tuning is about changing a model’s core behaviour, not just its knowledge—a surgical tool you reach for only when prompting and RAG are no longer enough.
You have a clear decision-making framework: a set of green flags that signal when to commit (like enforcing structure or mastering a complex task) and the critical red flags that tell you to stop (like insufficient data or the need for immediate control).
You see the process not as a single event, but as an iterative engineering loop: a disciplined cycle of curating data, training, and rigorously evaluating for what actually matters.
You can navigate the methods and trade-offs, choosing between the raw power of Full SFT, the efficiency of PEFT, and the performance-at-scale of distillation to match your specific constraints.
And you know the risks, with a clear understanding of how to mitigate common failures like catastrophic forgetting, bias amplification, and the degradation of safety alignment.

But let’s be clear:

Fine-tuning isn't a checkbox.
It's a surgical override on model behaviour
And every override comes with a responsibility.

At NeoSage, we don’t just teach tools. We teach how to think with them.

And as Nocto would whisper from the shadows of your prompt window

“Steer with context. Train with care. And never change the weights unless you’ve earned the right.”

Your intuition is now tuned.

References & Further Reading

An Engineer’s Guide to Fine-Tuning LLMs – Part 1

Shivani Virdi — Thu, 12 Jun 2025 06:33:43 GMT

You're building a Q&A assistant for your internal analytics platform. You start with a powerful base model like Llama 3 or GPT-4o and implement a RAG (Retrieval-Augmented Generation) pipeline to feed it your table schemas and query examples.

It works, but only up to a point. Soon, the cracks appear:

Inconsistent Formatting: The model ignores your specified output structure, like consistently generating clean SQL or JSON.
Brittle Prompts: You're constantly tweaking prompts and few-shot examples just to maintain predictable behaviour for slightly different user inputs.
Poor Steerability: The model fails to adhere to specific constraints, like always using JOIN on the correct foreign key or avoiding deprecated functions.

You're no longer just guiding the model; you're fighting its fundamental tendencies.

This isn't a knowledge problem; RAG is already providing the necessary context. This is a behaviour problem.

Prompting is about giving the model better instructions.
RAG is about giving the model better knowledge.
Fine-tuning is about teaching the model a new skill.

Fine-tuning fundamentally changes the model itself. By updating the model's internal weights on your own curated data, you aren't just telling it how to act—you are reshaping it to be the model your product needs. It internalises your specific data structures, response formats, and desired logic.

In Part 1 of this two-part issue, we'll cover:

The Core Mechanism: Understand what fine-tuning actually changes in a model and why it's a completely different tool than prompting or RAG.
The Strategic Context: See where fine-tuning fits in the LLM lifecycle to understand its unique power as an application-layer tool.
The Decision Framework: Get a clear set of green flags for when to commit to fine-tuning, and the critical red flags that tell you it will waste your time.

Let's dive in.

Subscribe now

1. The LLM Value Chain: Pre-training, Alignment, and Specialisation

(Recap from Issues 1 & 2)

To understand why fine-tuning exists, you need to see where it fits in the lifecycle of a language model. Let's stitch the layers together.

1.1 Pre-training: The Foundation Layer

This is where models like GPT-4o, Claude, and DeepSeek-R1 begin. Pre-training is unsupervised learning on massive text corpora by predicting the next token in a sequence. No tasks, no labels, just pure pattern recognition at scale. This gives the model general linguistic competence and vast factual knowledge, but it doesn’t know how to follow instructions safely or effectively.

1.2 Alignment: Teaching the Model to Behave

A raw, pre-trained model needs to be made useful. The alignment phase teaches it to:

Respond to instructions clearly
Refuse unsafe prompts
Format answers correctly
Align with human expectations

This is typically done via a suite of fine-tuning techniques:

Supervised Fine-Tuning (SFT): Training on instruction-response pairs to teach task formats.
RLHF & DPO: Optimising outputs based on human preference pairs to guide model behaviour.

Technically, fine-tuning means updating a pre-trained model’s weights with new data, which is exactly what happens here. So when you use GPT-4o, you’re already using a model that has been fine-tuned by its provider for general helpfulness, but not for your specific domain or product.

1.3 Application-Layer Fine-Tuning: The Missing Layer

This brings us to the layer you control. As a builder, you don’t control pre-training or the base alignment objectives. But you can fine-tune the model again, on your own domain, tasks, and constraints, to make it work inside your system.

This issue is about that layer.

It's the same mechanism for a different purpose. Fine-tuning, not to teach the model how to behave, but to make it behave your way.

2. What Fine-Tuning Really Is

In the last section, we mapped out the LLM value chain. Now, it's time to get precise about the layer you control. What exactly is fine-tuning, what makes it fundamentally different from prompting or retrieval, and why is it the only method that changes the model itself?

2.1 The Systems View: Context vs. Weights

Every large language model operates on two distinct layers of information:

The Weights: Billions of learned parameters that encode what the model has internalised during training.

The Context: Everything you pass in at runtime—your prompt, few-shot examples, retrieved documents, and tool specs.

Most techniques, like prompting and RAG, operate solely on the context. They inject information at inference time, guiding the model without changing its internal structure.

Fine-tuning is different. It directly updates the weights. It is another round of training on your specific data, using the same architecture and backpropagation, to permanently reshape how the model generalises. You're not just showing it new information; you're changing how it "thinks."

2.2 What Fine-Tuning Actually Changes

A model's weights do more than store facts; they define how it interprets prompts, routes reasoning, and prioritises outputs under uncertainty. Fine-tuning shifts these core decision boundaries.

It teaches the model to handle variations in phrasing without needing extra prompt instructions.
It makes your desired output structure a native behavio^ur, not just a format to be followed.
It embeds your domain-specific logic directly into the model, reducing reliance on bulky few-shot examples.

The resulting model doesn't just respond differently—it reasons differently. This is what makes fine-tuning a structural change, not a surface patch.

2.3 Why Provider Fine-Tuning Isn't Enough

The off-the-shelf model you use has already been fine-tuned by its provider for general safety and helpfulness. But that is not the same as optimising for specific, high-stakes business logic, such as:

Emitting outputs that are bound to a strict API contract.
Correctly resolving ambiguous terms unique to your product's schema.
Refusing to access sensitive data in your internal systems, even when asked politely.

These are not general instruction-following problems; they are domain adaptation problems that depend on your data and success criteria.

Prompting can mask these issues, and retrieval can bridge knowledge gaps, but only fine-tuning can encode this specialised behaviour directly into the model's weights.

It has a higher upfront cost, but it's the only method that makes the model truly yours.

3. The Adaptation Spectrum: Prompting → Retrieval → Fine-Tuning

Fine-tuning isn't the first tool you reach for, nor should it be. In the application layer, there's a spectrum of techniques engineers use to adapt a language model's behaviour.

Each one solves a different kind of problem. If you don't understand what each technique does, you risk overengineering the solution or misdiagnosing the issue entirely.

This section lays out what each technique changes

3.1 Prompting: Leveraging What the Model Already Knows

Prompting is the cheapest lever and, in many cases, a surprisingly effective one. Even base models that haven't been aligned show signs of instruction-following. Why? Because of how they're pretrained.

When you pretrain on large corpora of internet text, code, and tutorials, the model learns patterns like:

Q: followed by A:
Function definitions followed by documentation
"How do I..." questions followed by step-by-step instructions

So when you write a clean instruction, even to an unaligned model, you're not asking it to be helpful. You're asking it to complete a statistical pattern it's already seen thousands of times.

This became widely visible with GPT-3, where studies showed that even without gradient updates, zero-shot and few-shot prompts produced relevant answers.

The Catch: It's a runtime illusion. Prompting gives the model hints; it doesn't rewire its understanding. Failure modes appear quickly: outputs are sensitive to phrasing, behaviour breaks with inconsistent inputs, and structured output like JSON or SQL is brittle.

3.2 Retrieval: Giving the Model New Knowledge at Runtime

When the model lacks facts about your company, product, or recent policies, prompting isn't enough. This is where retrieval augmentation comes in.

You build a retrieval layer that fetches relevant documents and injects them into the model's context window at inference time. This doesn't change the model's weights, but it changes what it sees before generating a response.

This works exceptionally well when:

You need factual accuracy grounded in private or internal data.
The query is specific to your business, user, or task.
You want outputs to reflect recent changes without retraining the model.

The Catch: Retrieval provides facts, not skills. The model can see the right information but still uses the wrong tone, fails to follow complex domain-specific logic, and can't reliably generate the structured JSON or API calls your system requires.

Retrieval fills knowledge gaps, but it doesn't modify how the model uses that knowledge.

3.3 Alignment: Making the Model Generally Helpful

The model you use in production isn't just a pretrained model; it's a pretrained and aligned model.

After pre-training, providers run more fine-tuning phases using SFT and RLHF/DPO to make the model follow instructions and prefer helpful, safe outputs. This is provider-controlled, general-purpose fine-tuning.

Alignment datasets include common-sense Q&A, summarisation, and dialogue. What they don't include:

Your business logic
Your tool APIs
Your schema constraints
Your data formats

The Catch: Alignment optimises the model to be broadly helpful, but not precisely correct. The model will try to please the user, but won’t obey your internal logic.

3.4 Fine-Tuning: When Behaviour Has to Live Inside the Model

Prompting and retrieval adapt the model from the outside, but don't change how it generalises or shift its internal representations. That's what fine-tuning does.

When you fine-tune,even partially, with methods,you are updating the weights. You're modifying the statistical pathways that govern interpretation and retraining the model's instincts.

Done well, fine-tuning enables:

Structural consistency: Always outputting a tool call in your exact JSON schema, even if the user request is vague or phrased differently.
Domain-native reasoning: Applying internal business rules or specialised jargon as if they were part of the base training data.
Prompt-free formatting: You don’t need 20-shot prompts to guide output behaviour, it’s embedded in the weights.
Latency and context savings: No need to re-explain your needs every time, the model starts closer to your expected output by default.

Where alignment seeks to make a model broadly helpful, fine-tuning makes it specifically reliable.

It’s heavier. It needs infrastructure.

But it’s the only lever that actually changes the model’s behaviour permanently, across phrasing, across prompts, across tasks.

4. When Is Fine-Tuning the Right Answer?

You've seen the full adaptation stack. This brings us to the question every builder eventually hits:

"Should we tune this model, or are we just not prompting it well enough?"

Here’s the simplest way to think about it: Fine-tuning is what you do when your prompts have plateaued, retrieval has hit its limits, and the model's general alignment doesn't transfer to your specific task.

The failure isn't about what the model sees—it's about how it behaves.

Let's walk through the five scenarios where fine-tuning becomes a real solution, not just a nice-to-have.

4.1 You Need to Reliably Enforce a Strict Structure

You're trying to get the model to generate structured data, like JSON objects or API calls. Your prompt engineering is sophisticated, using few-shot examples and schema definitions.

But you still face constant issues:

Brittleness: The model works for common cases but breaks on novel inputs.
Inconsistency: It occasionally hallucinates fields, uses the wrong data types, or generates unparsable syntax.

This is a classic limitation of in-context learning. The model is mimicking patterns, not learning the underlying grammar of your format.

Fine-tuning makes that structure a native capability. By training on hundreds or thousands of valid examples, the model internalises the schema's rules.

4.2 You Need to Master a Complex, Nuanced Task

Your task requires a nuanced understanding that goes beyond general knowledge. For example, classifying user feedback into a highly specific, multi-level taxonomy with over 50 labels.

With prompting, you hit a performance ceiling because:

The context window can't hold enough examples to cover all the edge cases.
The model struggles to differentiate between closely related labels.

This isn't a prompting failure; it's a task-understanding gap. The model lacks a deep representation of your specific problem space.

Fine-tuning closes this gap by training the model on thousands of labelled examples, allowing it to learn the subtle patterns and decision boundaries of your unique taxonomy.

4.3 Your Domain's Semantics Are Underrepresented

Your application operates in a specialised domain like bioinformatics, patent law, or financial compliance. The base model, trained on the general internet, doesn't understand your domain's jargon, entities, and relationships. It treats critical keywords as noise.

While RAG can retrieve documents, it doesn't teach the model how to interpret them like an expert.

Fine-tuning teaches the model the semantics of your domain. It learns that in a medical context, for instance, certain terms have specific implications that are absent in general text.

4.4 You Need to Transfer a Capability to Another Language

Your product works flawlessly in English, but its performance collapses for your German or Japanese users. The model can translate text, but it fails to apply complex skills, like its ability to follow multi-step instructions or use tools, in the new language.

This is a capability-transfer gap. The complex reasoning skills learned during alignment are often English-centric and don't automatically generalise.

Fine-tuning on multilingual examples of your specific task is the most effective way to transfer a model's core capabilities across languages.

4.5 You Need to Distill a Capability into a Cheaper, Faster Model

Your prototype, built on a state-of-the-art model like GPT-4o, works perfectly but is too slow and expensive for production scale.

When you try to switch to a smaller, faster open-source model, performance plummets.

This is a common use case for distillation, a form of fine-tuning where you train a smaller "student" model on the outputs of a larger "teacher" model.

You use the teacher model to generate a high-quality, synthetic dataset of thousands of examples for your specific task.

Then, you fine-tune the smaller student model on this data.

5. When Fine-Tuning Will Waste Your Time

Fine-tuning isn't cheap in time, cost, or complexity. And unlike a prompt tweak, its effects aren't easily reversible. Before committing to a fine-tuning project, you must ensure you aren't trying to solve the wrong kind of problem.

Here are four red flags that should tell you: "This isn't a fine-tuning job—or at least, not yet."

5.1 You Don't Have Enough High-Quality Data

Fine-tuning works by adjusting the model's weights based on the patterns in your examples. If those examples are too few, too noisy, or don't accurately represent the behaviour you want, you will be training the model on garbage.

Common failure modes include:

Overfitting: With too few examples (e.g., < 500-1000 for many tasks), the model doesn't learn the general skill you're trying to teach. It just memorises the specific examples, failing to generalise to new, unseen inputs.
Noise Amplification: If your data is noisy or inconsistently labelled, the model will faithfully bake that noise directly into its weights, making its behaviour erratic and unreliable.

A critical first step is to validate your task with In-Context Learning (ICL), or few-shot prompting. If you can't get reasonable performance by showing the model a handful of high-quality examples in a prompt, fine-tuning on a larger set of those same examples is unlikely to succeed.

5.2 Your Task Relies on Volatile, Fast-Changing Information

Fine-tuning is for teaching the model a persistent skill or style. It is the wrong tool for teaching knowledge that changes frequently.

If your use case relies on information that is updated daily or hourly, such as:

Product inventory and pricing
Live news or event tracking
Real-time user data

...then a fine-tuned model will be perpetually out of date. Each update would require retraining and redeploying the model, creating a massive maintenance overhead.

This is a classic use case for Retrieval-Augmented Generation (RAG). RAG is designed to provide the model with fresh, volatile information at inference time, separating the model's stable "skills" from the dynamic "knowledge" it needs to act on.

5.3 You Have Strict Latency or Deployment Constraints

Fine-tuning can impact your deployment architecture and performance budget. Even parameter-efficient methods like LoRA require the entire base model's weights to be loaded into GPU memory for inference.

This presents a problem if you are deploying in a constrained environment:

Edge/Mobile: A 7-billion parameter model, even a quantised one, can be too large and slow for on-device applications with tight memory and latency budgets.
High-Throughput Services: If your service needs to handle thousands of requests per second with low latency, the cost of serving a large, fine-tuned model can be prohibitive.

Before fine-tuning, profile your target deployment environment. Sometimes, clever prompting on a smaller, faster model or using a highly optimised API is a better solution than deploying a fine-tuned model yourself.

5.4 The Task Demands Immediate Controllability

Fine-tuning hardcodes behaviour. If the model develops a flawed or harmful tendency, you can't fix it with a quick prompt change. The flaw is now part of the model's core logic, and fixing it requires a new training and deployment cycle.

This creates a critical trade-off: power vs. control.

While fine-tuning is powerful for teaching domain knowledge (as we saw in 4.3), it is a high-risk choice for applications where the ability to immediately patch, steer, or disable a behaviour is the top priority.

This is especially true for:

Customer-facing applications with direct brand exposure.
Domains where new failure modes can emerge rapidly (e.g., new types of scams or adversarial attacks).

In these scenarios, keeping logic in the prompt or in external rule engines gives you more immediate control. You can update a guardrail prompt in seconds; a fine-tuning run takes hours or days. If your operational posture requires instant intervention, fine-tuning may introduce an unacceptable level of response lag.

Wrap-up: The Strategic Edge

Fine-tuning isn’t a magic wand. It’s a deliberate architectural decision to trade runtime controllability for baked-in, specialised behaviour.

You now have the framework to make that trade. You know when to pull that lever—to enforce a strict structure, master a complex task, or distill a capability into a more efficient model. And just as importantly, you know the red flags that signal it's the wrong choice, saving you from wasting time, money, and effort.

But strategy is only half the equation.

In Part 2, we go from “Should I fine-tune?” to “How do I fine-tune well?”

We’ll cover the engineering reality of execution:

The Modern Methods: How to navigate the trade-offs between full fine-tuning, the efficiency of PEFT (LoRA/QLoRA), and the scale of distillation.
The Production Pipeline: A step-by-step walkthrough of the end-to-end workflow, from curating the perfect dataset to safe deployment and monitoring.
The Technical Risks: A guide to the real-world failure modes—like catastrophic forgetting and silent regressions—and the engineering discipline required to prevent them.

If Part 1 was about when to do it, Part 2 is about how to do it without breaking everything.

References & Further Reading

The Dangerous Thing About AI Hype?

Shivani Virdi — Fri, 23 May 2025 02:15:10 GMT

NeoSage isn’t shipping a full issue this week.

Not because there isn’t enough to write.
But because there’s too much to cut through.

I’ve been head-down, curating what comes next — and I mean really curating. Because I don’t just want to publish another deep dive. I want every issue to raise the bar, sharpen your intuition, and help you think in systems, not soundbites.

And lately, there’s been too many of those.

Everywhere you look, the AI space is on fire — but not always in a good way.

You’ve got founders, CEOs, and VC-backed evangelists sprinting to say the same thing, louder and faster:

“AI is replacing humans faster than we can adapt.”
“You don’t need developers anymore — just AI.”
“Build 10x faster, deploy in hours. Vibe code your way to production.”

These statements are problematic for several reasons, and today, I want to walk you through exactly why.

Because when hype becomes the loudest voice in the room,
clarity becomes a responsibility.

And that’s what this issue is about.

Subscribe now

Before we dive in, Meet Nocto

Now, before we dive in, there’s someone I’ve been meaning to introduce.

You’ve probably seen him perched silently at the corner of our visuals —
the quiet observer with far too much caffeine and not enough patience for low-quality takes.

That’s Nocto — the NeoSage owl.
Cynical. Sharp-eyed. Lives on espresso and questionable humour.
Also, the only creature I trust to edit my drafts without hallucinating a product roadmap.

Nocto’s been around since the early days of transformer papers — quietly watching the rise, the hype, and the chaos.

He doesn’t speak often, but when he does, it’s usually something like:

“That won’t scale.”
“That prompt’s going to blow up in prod.”
“Add a failure mode or it’s just a fantasy.”

So if you see Nocto lurking around the margins of NeoSage…
Just know he’s watching the same hype train I am, and rolling his eyes just as hard.

Say Hi, and let’s get back!

What These Narratives Miss

Let’s talk about what these narratives are actually doing.

Because statements like

“AI will replace X% of all professionals by the end of this year,”
or “You don’t need developers anymore — just AI,”

…they’re not just loud headlines.
They’re framing devices — and they come with consequences.

First, they create panic.

If you’re a developer, a designer, a customer support rep, or anyone whose field is being mentioned in these projections, you’re not hearing encouragement to upskill or adapt.
You’re hearing: You’re on the way out.

That doesn’t help anyone build.
It only creates fear and often paralysis.

Second, they build an overly optimistic picture of what AI can currently do.
And I understand why that happens.

When billions have been invested in a product or platform, the pressure to deliver results often shifts into a pressure to sell the vision.
So you sell the potential — loudly.
Even if that potential still requires ten layers of scaffolding to hold up in the real world.

Third, they shift the focus from how we get there to what will be.

We stop asking:
How do we make AI outputs reliable?
What’s the failure mode here?
How do we structure systems that don’t fall apart in production?

And instead, we start asking:
Will I still have a job next year?

That’s not progress. That’s distraction.

Fourth — and this is the one I care about the most — they oversell the power of speed and cost reduction without ever showing people how to actually tap into it.

You can’t just tell people “AI will 10x your workflow” and walk away.
That’s not insight — that’s marketing.
And people with little to no experience end up paying for that gap in time, in technical debt, or in production failures that look good on demo day but collapse under load.

A few weeks ago, Sam Altman asked a panel audience:

“How many people here feel smarter than GPT-4?”

(Well... that’s kind of like asking whether I’m smarter than a calculator. I mean... anyway.)

But that’s the kind of framing I’m talking about.
It doesn’t inform. It doesn’t equip.
It impresses and subtly disempowers.

My Core Belief

I’m not against these conversations.
I’m not even against the ambition behind them.

I’m a massive proponent of AI.
That’s what NeoSage is all about — helping you understand how to work with these systems, not just admire them from a distance.

But what I’m concerned about is how we’re framing the conversation.

We talk about what AI might replace.
We talk about how fast it can build, ship, and scale.
We talk about cost reduction, fewer people, more speed.

What we don’t talk about enough is:
How to actually use it well.

Because AI is not magic.
It’s a technology — a tool — and like every tool we’ve ever built,
It’s only as powerful as the person using it.

The risk isn’t that people won’t use AI.
The risk is that people will use it wrong —
without knowing the limits, the failure modes, the trade-offs.

And that gap?
It doesn’t just slow you down.
It costs you in time, in quality, in reliability, and in ways that often show up too late.

So yes — AI can speed up development,
can reduce human effort,
can bring down operational costs.

But only if you understand what you’re working with.

Otherwise, you’ll pay for it.
And not just with money.

The Four Pillars of Building with AI — Responsibly

So what should we be saying instead?

If you're a leader, a founder, a CTO, or an AI builder —
You’re not just deciding whether to adopt AI.
You’re deciding how, where, and how far to take it.
And in a space moving this fast, that decision will either compound value or technical debt.

Here are four pillars I believe should stay top of mind as you build.

1. Expert Intuition Is Not Replaceable

At least not with current capabilities — or until you’ve built a fully orchestrated, truly autonomous system.
AI today can code, write, and generate. But it cannot know.
It has no mental model of your product, your users, your trade-offs, or your non-negotiables.

And until it does, expert oversight isn’t optional — it’s the only thing keeping your velocity from turning into fragility.

Replace too early, and what you gain in surface speed, you lose in root stability.

2. AI Is Not Magic — It’s a Tool

The mistake isn’t overestimating AI.
It’s forgetting that every system it touches needs guardrails, grounding, and fallback modes.
That’s not pessimism, that’s systems thinking.

If you’re treating the model as the product,
if you’re shipping prompts as logic,
if you’re trusting generative outputs without evaluation layers —
You’re not building software. You’re rolling the dice.

3. Security > Speed

Every AI product pitch says, “ship 10x faster.”
But no customer remembers how fast you shipped.
They remember when something failed.
Or worse, when something leaked.

As leaders, it’s easy to prioritise acceleration —
But your real edge isn’t in being fast.
It’s about being fast without compromising trust, traceability, or user safety.

Cutting corners on plain old security standards in favour of speed isn’t bold.
It’s shortsighted.

4. Systems Are Built on Discipline, Not Hype

The best Software systems in production today?
They aren’t magic. They’re well-architected.

They’re layered, observable, retrievable, resilient —
because someone treated them like systems, not stunts.

And that’s the job.

Not to follow the vision.
But to build what the vision requires —
under the constraints of latency, cost, safety, and scale.

That’s what separates hype from infrastructure.
And that’s where the real opportunity lives.

So if you’re leading the charge on AI
Don’t just ask what it can do.
Ask what it takes to use it well.

Adopting AI is no longer the hard part.
Building with it responsibly, robustly, and without regrets later —
That’s the real work.

This wasn’t a typical NeoSage issue — by design.

There’s so much noise in this space
What we need more of is context, clarity, and skin in the game.

Because most people don’t need another LinkedIn post telling them AI is the future.
They need someone to show them how to navigate it and build for it, without getting lost in the abstraction.

That’s what I’m trying to do here.
That’s what I’ll keep doing, issue by issue.

So next week, we get back to our usual programming.
Back to deep dives, frameworks, architecture, and intuition-first explanations.

But this week?
This one needed to be said.

If this resonated, share it.
If it challenges something, sit with it.
And if there’s a builder, leader, or CTO you know who’s making AI bets right now, send it to them.

Let’s raise the bar for how we talk about this space.
Because the future won’t be built by the loudest.
It’ll be built by those who know what they’re doing.

See you next week.
Shivani
Owl-thor, with Nocto silently judging from the corner

Inside DeepSeek-R1: A Masterclass in Incentivising Intelligence

Shivani Virdi — Thu, 15 May 2025 23:17:24 GMT

You’ve probably seen the benchmarks.

Open weights. Performance on par with OpenAI’s o1 series models

People were stunned.

Investors even started questioning OpenAI’s moat — stocks dipped.

But that’s not what makes DeepSeek-R1 remarkable.

Not really.

What actually matters — and what almost nobody talked about — is how it got there.

Because DeepSeek-R1 isn’t just a better-trained model.

It’s a blueprint for how to engineer intelligence into systems that were never taught what reasoning looks like.

No massive supervised dataset.

No army of human annotators.

No fancy process reward models.

Just a series of training choices that, when you look closely, form a system-level masterclass in making language models do more than predict tokens.

In this issue, I’ll walk you through the exact architecture, training loop, and lessons from the DeepSeek-R1 paper — not just to admire what they built, but to understand what we can borrow.

Because if your work involves LLMs that need to reason, align, or evolve over time —

DeepSeek-R1 isn’t just a model worth studying.

It’s a system worth stealing from.

Subscribe now

Why DeepSeek-R1 Mattered So Much, So Fast

When DeepSeek-R1 dropped, the headlines focused on one thing: performance.

79.8% on AIME 2024.
97.3% on MATH-500.
96.3 percentile on Codeforces.
On par with OpenAI’s o1-1217 — OpenAI’s ‘then’ best reasoning model

That alone was enough to cause a stir.

If you’re not big on benchmarks, don’t worry because me neither. Benchmark reports should always be taken with a big rock of salt. (Yes, rock salt. :P)

But what really jolted the industry was the cost-efficiency behind those numbers.

DeepSeek didn’t just release open weights — they released MoE architecture, inference-optimised routes, and a 671B parameter model that activates only 37B per forward pass.

The result? Comparable output quality at a fraction of the inference cost — and that did reflect in OpenAI’s stock price.

But even that isn’t the full story.

What makes DeepSeek-R1 impossible to ignore, especially for engineers, is this:

It wasn’t trained the way models are usually trained.

There was no massive supervised alignment stage.

No instruction tuning on millions of curated tasks.

No handcrafted demonstrations — just reward structures that made reasoning emerge on its own.

Instead, DeepSeek-R1 was built to answer a very different question:

Can you train a language model to reason, not by showing it what reasoning looks like, but by rewarding it when it gets it right?

That single bet is what makes this system so relevant.

Because the pipeline that emerged from it isn’t just academically novel — it’s a practical rethink of how to get reasoning from a model without incurring massive overhead.

And if you’re in the business of building applied LLM systems — whether that’s fine-tuning smaller models, training agents, or aligning behaviour — that question is your question too.

So from this point on, we stop looking at R1 as “a strong open model.”

And start looking at it as a system architecture — one that happens to make strong reasoning emerge with lower training burden, lower inference cost, and far better alignment with engineering constraints.

Let’s unpack that system.

Note for the reader:
This breakdown has been intentionally kept accessible, not to simplify the work, but to sharpen your intuition. The goal isn’t just to understand DeepSeek-R1, but to update your mental model so you can take these ideas to the application layer.

The Core Bet — Can Reasoning Be Incentivised, Not Taught?

Most modern LLMs are trained in two broad stages:

Pretraining — on massive amounts of raw text, predicting the next token across everything from books to code to forums
Post-training / alignment — where the model is fine-tuned to be helpful, truthful, and aligned with human intent

If interested in knowing more about how LLMs are trained, read these issues:

That second stage is where most of the nuance comes in — and where most models start to diverge.

The typical pipeline looks like this:

Start with a base model — a pretrained LLM that’s good at next-token prediction, but still brittle or unhelpful in practice
Apply Supervised Fine-Tuning (SFT) — feed it carefully curated examples of “good completions” for prompts, and nudge it toward copying that behaviour
Optionally, add a reward model trained on human preferences, then use reinforcement learning (usually PPO) to further optimise

SFT, in particular, became the backbone of alignment strategies because it's simple and data-efficient:

You just show the model enough “good completions” and let it imitate the output.

But that approach comes with limits.

Because while it can teach the model what correct answers look like,

It doesn’t necessarily help it understand how to think through the problem, especially in domains like math, logic, or program synthesis.

You end up with models that pattern-match well, but can’t adapt their process when the pattern changes.

And that’s exactly the gap DeepSeek set out to address.

Instead of building a reasoning model by showing it what reasoning looks like,

they asked a much more interesting question:

What if we just rewarded the model whenever it reasoned correctly — and let it figure out the rest on its own?

That’s the bet behind DeepSeek-R1-Zero.

No demonstrations. No handcrafted completions.

Just a base model, a carefully structured training loop, and a reward signal grounded in outcomes.

And surprisingly, it worked.

Here’s how:

They weren’t optimising for final answers alone.

They were incentivising process, behaviour patterns that resemble reasoning:

longer chains of thought, structured output, internal verification, and the correct final answer.

That meant two things:

The reward signal had to be grounded — e.g., for math, the model had to output its final answer in a strict format so correctness could be programmatically verified. For code, it had to compile and pass test cases.
The reasoning process had to be detectable, so they enforced a structured template: for intermediate reasoning, for final result

This wasn’t prompting. It wasn’t fine-tuning on examples.

It was incentive engineering — shaping the model’s behaviour by designing a reward system where reasoning becomes the optimal strategy.

To do this, they used Group Relative Policy Optimisation (GRPO) — a reinforcement learning approach that doesn’t require a separate critic network.

Instead of relying on external evaluation, GRPO works by sampling a group of outputs for each question and scoring them relative to one another.

The model learns by reinforcing whichever outputs perform best — a kind of internal competition — without needing labelled comparisons or reward models trained on preferences.

We’ll go deeper into how GRPO works in the next section. But the key idea here is this:

DeepSeek didn’t teach the model how to reason.
They built a system where reasoning was the best strategy for getting rewarded.

And that shift — from instruction to incentive — unlocks a fundamentally different kind of training pipeline.

If reasoning can emerge through reinforcement alone,

you’re no longer limited by how many examples you can label.

You’re only limited by how well you can define success.

That’s what makes DeepSeek-R1-Zero so important.

It’s not just a training variant.

It’s a new way to think about how intelligent behaviour gets built.

Inside R1-Zero’s Training Loop — How Reinforcement Actually Worked

To train DeepSeek-R1-Zero, the team didn’t start with labelled examples of “good” reasoning.

They started with a pretrained base model — DeepSeek-V3-Base — and no additional supervised data.

This base model was trained like most foundation models: on next-token prediction over large-scale web data.

At this stage, it had no alignment, no formatting consistency, and no reasoning skill beyond pattern matching.

The DeepSeek team didn’t fine-tune it on curated examples.

Instead, they designed a reinforcement learning loop that rewarded the outcomes of good reasoning and let the model figure out the process on its own.

This was the core design shift:

Don’t show the model how to reason. Just define what success looks like and let it discover reasoning as the optimal strategy.

Let’s walk through how this worked.

The Setup: From Base Model to Self-Improving System

The reinforcement setup had three main components:

A prompt dataset — questions/tasks covering math, coding, science, and logic
A reward function — that could score completions automatically
An RL algorithm — GRPO (Group Relative Policy Optimisation)

The model generated multiple completions for each prompt, and GRPO was used to update the model toward the better-performing ones.

But what makes GRPO different, and especially suited for this, is that it doesn’t require a critic model.

Let’s unpack that.

GRPO: What Changed and Why It Matters

Most reinforcement learning setups for LLMs (like PPO, used in RLHF) rely on a critic model — a second neural network trained to estimate how good an output is.

That’s expensive to train, hard to stabilise at scale, and can introduce noise if the critic itself is misaligned.

GRPO (Group Relative Policy Optimisation) drops the critic completely.

Instead, it scores outputs relative to each other within a group, using just a reward function — no second network.

Here’s the flow:

For each prompt, the model samples K outputs
Each output is scored with a rule-based reward
The group’s mean reward becomes the baseline
Each output’s advantage is:
where:
- R_{jk} is the reward for output k in prompt j
- \bar{R}_j is the mean reward for all K outputs for that prompt

This replaces PPO’s value function with group-level comparison.

And instead of needing a value estimate, GRPO just says:

“Which completions were better than average?”

Then it nudges the model to prefer those, using a KL penalty to stay stable.

The full update loss looks like:

Why this matters for models like DeepSeek-R1:

✅ It’s stable across large batch sizes
✅ It’s cheap — no critic to maintain
✅ It scales — GRPO was used to train DeepSeekMath with 64 completions per prompt

And most importantly, it works beautifully in domains where you can verify outputs, like math and code. That’s what made it the backbone of DeepSeek-R1-Zero.

GRPO doesn’t teach by example.

It teaches by comparison and lets the model discover the better path.

The Reward Functions: How Reasoning Was Incentivised

DeepSeek-R1-Zero used two reward signals, both programmatic and fully automatable:

Accuracy Reward (These examples are for understanding purposes and not sourced verbatim from the original paper)
- For math tasks: reward = 1 if the final answer was correct (e.g., matched a boxed number), 0 otherwise
- For code: reward = 1 if the code compiled and passed test cases, 0 otherwise
- For logic/science: multiple-choice answer correctness or rule-based consistency checks
Format Reward
- Every output had to follow a fixed structure:
```
 reasoning steps 
 final answer 
```
- Outputs that violated the format were scored 0 and ignored

This structure wasn’t decorative — it was essential.

The section forced the model to externalise intermediate reasoning.
The section made verification easy and automated.

Together, these two rewards created a simple loop:

If you think clearly and answer correctly → you’re reinforced
If you hallucinate, skip steps, or break format → you’re ignored or penalised

And with GRPO driving learning, the model slowly evolved to prefer the reasoning strategies that led to high scores, even without seeing a single example of what “good reasoning” looked like.

The Aha Moments: What Emerged in the Process

This is where it gets interesting.

As training progressed, the model didn’t just get more accurate.

It started to behave as if it understood the value of thinking.

In the paper, the authors show examples like this:


Wait, wait. Wait. Let me try a different method...


[Correct boxed result]

This wasn’t cherry-picked.

This pattern of reevaluation, error checking, and iterative problem solving emerged purely from reinforcement.

The model began:

Taking longer paths to answers
Writing self-checking logic
Rephrasing its own steps mid-generation
Learning how to reason because it was the most reliable way to get reward

One intermediate checkpoint had already achieved:

71% Pass@1 on AIME-2024 (up from 15.6%)
86.7% with majority voting — matching OpenAI’s o1-0912

All without ever seeing a supervised CoT (Chain of thought) example.

This wasn’t just scaling token prediction.

This was behaviour change — learned from first principles.

Where It Fell Short

But R1-Zero wasn’t usable out of the box.

Despite its reasoning capability, it had critical flaws:

Poor readability: reasoning traces were verbose and messy
Language mixing: often switched between English and Chinese mid-output
No general instruction following: it wasn't aligned to be helpful or polite, just to reason

That’s why DeepSeek-R1 introduced a second-stage cold-start + RL stack.

But the point was proven: the model didn’t need instruction tuning to learn how to reason.

It needed a well-designed feedback loop.

Why This Loop Matters

This training loop gave us the first open proof that:

A model can develop reasoning behaviours purely from reinforcement
You don’t need to hardcode thought — you can incentivise it
GRPO offers a scalable, low-friction alternative to PPO and RLHF-style setups
Reasoning isn’t a dataset problem — it’s a system design problem

And for engineers building alignment stacks, agent loops, or low-cost reasoning assistants —

That opens a whole new frontier.

Because now, you don’t need to start with answers.

You just need to define the kind of outputs you want and design a loop that rewards getting there.

So happy to see you enjoy NeoSage. Share the love and send this to the smartest dev you know

From R1-Zero to R1 — Building a System That Aligns

By the end of R1-Zero, DeepSeek had something rare:

A model that could reason, without ever being shown how.

Through reinforcement alone, it had learned to chain thoughts, reevaluate steps, and converge on answers.

But it couldn’t present those answers clearly. It couldn’t follow instructions. And it didn’t know how to speak with the user in mind.

It was a prototype for reasoning, not a system ready to deploy.

The outputs were verbose. The formatting was unstable. The language switched mid-sentence.

And beyond STEM-style tasks, it struggled to handle general prompts — writing, summarisation, chat, translation.

R1-Zero made reasoning emerge.

The next challenge was: how do you keep that reasoning and shape it into something useful?

That’s what DeepSeek solved with R1.

But they didn’t solve it with more of the same.

They didn’t stack another RL pass or throw in alignment data midstream.

They built a multi-stage refinement pipeline, where each phase:

Solved a real, traceable failure from the one before
Preserved the capabilities that had already emerged
Introduced exactly what was needed — no more, no less

In the next four stages, they transformed a raw reasoner into a structured, general, aligned system, without breaking the behaviour they had trained from scratch.

Let’s break that pipeline down — one stage at a time.

Stage 1: Cold-Start Fine-Tuning

Goal: Fix readability, enforce output structure, and prevent language mixing.

R1-Zero’s outputs followed a basic structure:

 reasoning steps 
 final result

This worked for reward scoring, but failed in practice. The model’s completions showed:

Inconsistent or incoherent formatting
Excessive verbosity
Frequent English–Chinese language switching
No clear, summarised final answer

To stabilise the output, DeepSeek curated a cold-start supervised dataset composed of:

Few-shot prompted completions
Zero-shot generations
Manually refined outputs from R1-Zero

They introduced a new output structure:

 structured reasoning steps 
 user-facing final answer

This format improved outputs by:

Separating internal logic from final messaging
Constraining tone and fluency
Removing ambiguity in how the model should present conclusions

By showing it labelled dataset of good completions

The model was fine-tuned briefly on this dataset.

Not to teach new reasoning, but to create a stable interface between internal CoT and external output.

This format became the foundation for subsequent reward modelling and scoring.

Stage 2: Reasoning-Oriented Reinforcement Learning

Goal: Improve reasoning performance and enforce language consistency using structured RL.

With output structure stabilised through cold-start fine-tuning, DeepSeek returned to reinforcement learning to strengthen reasoning performance.

While the model could now follow the and

format, it still exhibited two key issues:

Incomplete reasoning convergence — performance on math, coding, and logic tasks had room to improve
Language mixing — particularly between English and Chinese, which impacted clarity and evaluation

To address both, DeepSeek applied another round of large-scale reinforcement learning, using the same GRPO (Group Relative Policy Optimisation) algorithm as in R1-Zero.

What was done

In this stage, the reward function was updated to include two components:

Reasoning Accuracy Reward
- Based on whether the final result was correct (e.g., boxed answer correctness in math, compilation and test success in code)
Language Consistency Reward
- Measured by the proportion of tokens in the target language
- Outputs with mixed-language tokens were penalised

This reward function was applied over reasoning-intensive tasks — specifically math, science, code, and logic — and training continued until convergence on those benchmarks.

What this enabled

This stage further strengthened the model’s reasoning ability — now with stable formatting, improved correctness, and monolingual output — without introducing alignment or general-purpose behaviours yet.

By reinforcing under tightly defined reward signals and clean output structure, the model was now ready to scale into broader domains.

Stage 3: Rejection Sampling + Supervised Fine-Tuning

Goal: Broaden model capability beyond reasoning tasks, while preserving structure and quality.

After reinforcement learning in Stage 2, the model demonstrated strong performance on reasoning-heavy benchmarks, such as math, coding, science, and logic.

But it was still limited to those domains. It lacked general-purpose abilities across:

Writing and role-play
Factual question answering
Translation
Dialogue and open-ended tasks

To expand coverage without compromising reasoning quality, DeepSeek constructed a new supervised dataset, composed of both:

Self-generated high-quality reasoning examples, and
General-task examples from DeepSeek-V3’s alignment pipeline

What was done

The new training set included approximately 800K samples, split as follows:

Reasoning Data (~600K)

Generated by running prompts through the Stage 2 model checkpoint
For each prompt, multiple completions were sampled
Each completion was scored using:
- Rule-based rewards (correctness, format)
- Judgment models from DeepSeek-V3
Only the highest-rewarded completions were kept, using rejection sampling

Non-Reasoning Data (~200K)

Reused from DeepSeek-V3’s SFT pipeline
Domains included:
- Role-play
- Factual QA
- Writing
- Translation
- Self-cognition
CoT was selectively included using prompting or omitted for simpler queries

Output format handling

Reasoning examples retained the and
format
For non-reasoning tasks, this structure was not always enforced
Some factual tasks used only
, and others followed typical chat-style instructions

This flexible formatting ensured that reasoning quality was preserved while adapting outputs to the task type.

Training configuration

The combined dataset was used to fine-tune the model for two epochs
No additional alignment rewards or reinforcement were introduced at this stage
The goal was to solidify generalisation while maintaining structured output for reasoning tasks

What it enabled

By combining curated self-generated reasoning traces with diverse, human-aligned general tasks, this stage produced a model that could:

Reason deeply
Communicate fluently
Generalise across prompt styles and domains

And it did so without erasing the carefully reinforced behaviours from prior stages.

The next step was to align it with helpfulness and safety under real-world constraints.

Stage 4: Reinforcement for Alignment and Safety

Goal: Align behaviour across general scenarios using reward models for helpfulness and harmlessness.

By the end of Stage 3, the model could reason fluently and generalise across domains, but it still lacked behavioural alignment in subjective and open-ended tasks.

It wasn’t reliably helpful. It didn’t avoid unsafe completions. It didn’t consistently reflect human intent in tasks like summarisation, chat, or instruction following.

To address this, DeepSeek introduced a final round of reinforcement learning, focused on alignment.

What was done

DeepSeek applied additional RL training using a new set of reward signals.

For reasoning tasks:

The model continued to receive rule-based rewards, as in previous stages

For general tasks:

DeepSeek used reward models from DeepSeek-V3 to capture alignment signals:
- Helpfulness — evaluated over the
  portion
- Harmlessness — evaluated over the entire response

From the paper:

“For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios.” (§2.3.4)

While the paper doesn’t specify how these reward models were trained, it makes clear that they were used to model human preferences for prompts where correctness or structure alone couldn’t define quality.

Prompts were drawn from a diverse distribution, and each was evaluated using the appropriate reward signal, depending on whether it was a reasoning or general instruction-following task.

What it enabled

This final reinforcement phase aligned the model’s outputs with human expectations, making it more helpful, more appropriate, and more consistent across real-world use cases.

With this step, DeepSeek-R1 became not just a reasoner, but a usable system that combined logical capability with structured communication, generalisation, and safety.

Distilling Reasoning — Teaching Smaller Models to Think

Goal: Transfer R1’s reasoning ability into smaller, open-weight models using SFT alone.

Once DeepSeek-R1 had been trained, the next question was:

Can its reasoning capability be transferred, not just deployed?

Instead of running expensive RL loops on every downstream model, DeepSeek explored whether smaller base models could be taught to reason by learning from R1’s behaviour directly.

This wasn’t about copying parameters — it was about teaching via output.

And it worked.

What was done

DeepSeek used the ~800K dataset created in Stage 3 — made up of high-quality reasoning and general-task examples — to distil R1 into a new series of models.

They fine-tuned several base models using supervised learning only (no reinforcement):

Qwen2.5 series: 1.5B, 7B, 14B, 32B
Llama3 series: 8B, 70B

Each of these base models was fine-tuned using the R1 dataset, capturing both its structured reasoning and generalisation behaviours.

“We fine-tune several dense models…using the reasoning data generated by DeepSeek-R1.” — §2.4

No reinforcement learning was applied during distillation.

The distilled models learned entirely by mimicking R1’s output, supervised fine-tuning on prompts and completions.

What emerged

The results showed that R1’s reasoning capability could be transferred, even without re-running the RL loop.

DeepSeek-R1-Distill-Qwen-14B outperformed QwQ-32B-Preview
DeepSeek-R1-Distill-Qwen-32B achieved:
- 72.6% on AIME 2024
- 94.3% on MATH-500
- 57.2% on LiveCodeBench
These models surpassed o1-mini on several reasoning-heavy benchmarks

This proved that reasoning, once made emergent in a larger model, could be replicated downstream, even in smaller dense architectures.

What it means for builders

This distillation loop wasn’t just about compressing size — it was about compressing capability.

It showed that:

Small models can reason
But only if the teacher model learned to reason first
Reinforcement learning isn’t always scalable, but its outcomes can be scaled through careful distillation

For any builder working on LLMs with limited compute, this changes the calculus.

You don’t need to start with a reasoning-capable small model.

You need a good teacher.

And if the teacher is something like R1, you might only need supervised fine-tuning to get very far.

Final Mental Models — What This Issue Leaves You With

You can train reasoning without examples, but only if you can score it.
R1-Zero proved this. The constraint wasn’t data — it was verifiability. Reward mattered more than supervision.
You can't reinforce cleanly until your outputs are readable.
Cold-start SFT wasn’t for alignment — it was to create a trainable structure. Format isn’t UX. It’s part of the loop.
Language consistency is a rewardable trait, not a hardcoded switch.
R1 didn’t block multilingual output. It made consistency the path to reward. That design generalises.
Distillation works only when the source system has the right behaviours.
No small model figured it out from scratch. The ones that worked copied output from a pipeline that already had reasoning baked in.
Strong models aren't trained once. They're debugged in passes.
Each stage in R1 fixed something broken by the last, without losing what worked. That’s the real blueprint.

References & Further Reading

The Engineer’s Guide to RAG

Shivani Virdi — Wed, 07 May 2025 20:56:37 GMT

Modern LLMs are powerful — but they’re also static.

Once a model is trained, it can’t learn anything new.

No internet. No updates. No awareness of your internal docs, customer tickets, or product features.

That’s because language models aren’t knowledge bases.

They don’t “look up” information.

They predict the next token based on patterns seen during training, and that training ended months ago.

Yet in real-world systems, most queries aren’t about abstract language patterns.

They’re about your data.

What’s our refund policy?
What did the customer say in the last chat?
What does this API do?

These are context-dependent questions.

And answering them requires injecting the right context at the right time.

That’s where Retrieval-Augmented Generation (RAG) comes in.

This issue is your technical blueprint.

We’ll walk through:

Why static LLMs fall short
Why fine-tuning isn’t always the fix
And how RAG lets your model access fresh, dynamic, grounded context, without touching a single weight

By the end, you won’t just understand RAG.

You’ll know when, how, and why to use it in production.

Subscribe now

The Real Problem

LLMs aren’t dynamic systems.

They’re static functions — mapping input tokens to output tokens based on frozen training data.

This has two major consequences for real-world applications:

1. They can’t access private or real-time information.

Your model might be brilliant at writing SQL —

But it knows nothing about your schemas, tables, or naming conventions.

It might be great at summarising —

But it can’t summarise your product docs if it’s never seen them.

2. They hallucinate confidently when they don’t know.

LLMs are next-token predictors.

When they lack relevant context, they don’t say “I don’t know.”

They interpolate. And that often leads to fabricated answers, which look fluent but fail under scrutiny.

This isn’t a bug — it’s a design constraint.

A pretrained model is a static snapshot.

If your application needs current, personalised, or proprietary knowledge, you need to pipe that knowledge in at inference time.

That’s the core challenge RAG is designed to solve.

But before we get to RAG, let’s zoom out and explore all the ways builders try to solve this data gap.

Why Prompting and Fine-Tuning Can’t Solve the Knowledge Gap

Once you realise your model doesn’t know your product, your user history, or your internal docs, the next question is:

How do we teach it?

There are multiple strategies to work with LLMs in the application layer, two of which are:

Prompt Engineering and Fine-Tuning.

Both can be powerful, but neither truly solves the problem we’re dealing with:

Giving a frozen model access to dynamic, user-specific, or time-sensitive knowledge.

Let’s step through them.

1. Prompt Engineering — Helpful, But Limited

Prompting is about shaping the model’s behaviour at inference time.

You’re not teaching it new facts — you’re teaching it how to respond based on what it already knows.

Prompt engineering is useful for:

Formatting answers
Steering tone and voice
Guiding reasoning (e.g. Chain-of-Thought, ReAct)
Enforcing structure (e.g. JSON output, few-shot examples)

But here’s the core limitation:

Its primary role is to guide how the model behaves — any data you add to the prompt improves response quality, but sourcing that data isn’t what prompt engineering solves.

It might be easy to confuse prompt engineering with prompt augmentation.

But adding new context into the prompt, like search results or documentation snippets, is a separate step.

That’s not prompt engineering.

That’s prompt augmentation — and that’s what RAG is built to automate.

So while prompt engineering improves fluency and structure, it does nothing for grounding the model in your data.

2. Fine-Tuning — Powerful, But Inflexible

Fine-tuning is about modifying the model’s weights.

You take a base model and train it further on new examples — either task-specific, domain-specific, or instruction-style.

This helps in scenarios like:

Teaching the model legal or medical terminology
Improving performance on repetitive, structured workflows
Adapting to company-specific language or formats

But in the context of our core problem — giving a model access to live, evolving, or user-specific data — fine-tuning has major limitations:

❌ It’s slow and costly — requires GPUs, training infra, and QA cycles
❌ It’s brittle — every update means retraining or risking drift
❌ It’s static — the model remains locked after each fine-tune
❌ It’s inflexible — different users or contexts need different versions

Fine-tuning is best when your knowledge is stable and your tasks are narrow.

But it falls apart when you want a model to respond to:

“What’s the latest version of our API docs?”
“What did this customer say in their last ticket?”
“What changed in the HR policy last week?”

That’s not a training problem.
That’s a retrieval problem.

And that brings us to the solution this issue is all about — one that doesn’t update the model at all, but updates what the model sees at runtime.

Let’s talk about RAG.

So What Is RAG?

At its core, Retrieval-Augmented Generation (RAG) is a simple but powerful pattern:

Instead of retraining the model, you retrieve relevant information from an external source and inject it into the prompt — at inference time.

That’s it.

No gradient updates.

No fine-tune cycles.

Just smarter input.

Here’s the distinction that matters:

The model remains frozen — it still runs the same next-token prediction function.
What changes is the context window:
You augment it with fresh, task-relevant knowledge pulled from a database, knowledge base, or internal document store.

In other words:

RAG doesn't teach the model.
It feeds it better inputs — right when it needs them.

This makes RAG fundamentally different from:

Fine-tuning (changes the model)
Prompt engineering (tweaks behaviour)
Tool use (delegates tasks)

RAG treats the model as a black box and solves the problem outside it.

It shifts the system design from "How do I modify the model?" to

"How do I retrieve and inject the right context before the model answers?"

This single shift unlocks:

Real-time updates without retraining
Personalisation per user/session
Seamless integration of internal knowledge

But it also introduces a new bottleneck:

Your model is now only as good as what you retrieve and what you stuff into the context window.

RAG moves the complexity from training to retrieval.
That’s not a simplification — it’s a re-architecture.

What RAG Really Does

At its simplest, RAG (Retrieval-Augmented Generation) does three things:

Retrieves the most relevant data based on your question
Augments the LLM’s prompt with that data
Generates a grounded response using both the query and retrieved context

You’re not teaching the model anything new.
You’re giving it just enough information to answer the question as if it knew.

That’s it.

Imagine ChatGPT — but before it responds, you hand it a Post-it Note saying:

“By the way, here’s what our refund policy says.”

And then it writes the answer. That’s RAG.

The Retrieval-Augmented Loop

Now let’s step through what’s actually happening under the hood.

Step 1: User Query

The user asks a question in plain language:

“What’s our refund policy for digital products?”

At this point, the model on its own doesn’t have a clue.

It wasn’t trained on your policy docs.

So instead of letting it hallucinate —

we retrieve the answer from a trusted source.

Step 2: Embed the Query

We convert the user’s query into a vector — a dense numerical representation that captures semantic meaning.

This is called an embedding.

The sentence “refund for digital orders”
might produce a vector like [0.22, -0.87, 1.03, ...] — 768+ numbers long.
Another sentence like “return policy for ebooks” would produce something nearby in vector space.

These vectors aren’t based on keywords — they’re based on meaning.

Embeddings let us match “What’s your refund policy?”
to a sentence that says, “Customers can request a refund within 7 days of purchase.”

That’s how semantic search works — it’s meaning-based, not string-based.

Step 3: Vector Search Over Your Corpus

We now search this query vector against a vector database, like Weaviate, Qdrant, Pinecone, or FAISS.

But this database doesn’t store raw documents.

It stores pre-chunked, pre-embedded pieces of your knowledge base, such as:

Individual help articles
Paragraphs from your product manual
Snippets of legal policy text
Past conversations or support threads

Each one is already embedded as a vector.

Now we calculate which of those vectors are closest to our query in vector space.

This is where “top-K retrieval” happens. We fetch the K most semantically similar pieces of content.

Step 4: Return Top-K Chunks

Let’s say your query embedding matches 3 chunks closely:

“Refunds for digital products must be requested within 7 days.”
“Refund requests can be submitted via dashboard or email.”
“Physical products have a 30-day return window.”

These chunks are returned, often with scores.

In basic setups, we stop here.

But in smarter systems, we might apply:

Re-ranking: to push the most relevant one to the top
Filtering: to remove irrelevant ones
Scoring models: to judge answerability

Step 5: Inject Into Prompt

These retrieved chunks are then formatted and injected directly into the LLM’s prompt.

It might look like this:

Context:
1. Refunds for digital products must be requested within 7 days.
2. Refund requests can be submitted via dashboard or email.

Question:
What’s our refund policy for digital products?

To the model, this is just part of the input.

It has no idea it came from retrieval — it just treats it like any other text.

Step 6: LLM Responds Grounded in That Context

The LLM reads the entire prompt — your system instructions, the context chunks, and the query.

Then it does what it always does:

predict the next most likely tokens.

But now, because the context window contains the right information, the response is grounded:

“Customers can request a refund for digital products within 7 days of purchase. You can do this via dashboard or by email.”

No hallucination. No guessing.

Just answering with what you gave it.

And in Smarter RAG Systems?

The flow stays the same, but we enhance individual steps to boost relevance and reliability.

Advanced RAG systems often include:

Re-ranking: Using a second model (e.g. cross-encoder) to rescore and reorder the top-K chunks
Query rewriting: Transforming vague or underspecified user queries into more precise ones
Chunk scoring: Assessing how well a chunk answers the question before injecting it
Context pruning: Removing low-value or redundant content to save tokens
Routing models: Choosing between different knowledge sources, agents, or workflows dynamically

These are not optional tricks — they’re often what separates production-ready RAG systems from toy demos.

Data Is God

Once you understand how RAG works, it’s tempting to think the model is doing the heavy lifting.

But here’s the truth:

The model just fills in blanks.
The real work happens before the prompt is ever built.

If your retrieval layer is weak, your output will be wrong, no matter how smart your LLM is.

That’s why in production-grade RAG, preprocessing and ingestion are the hardest and most critical steps.

What Makes or Breaks a RAG System?

Let’s rewind the loop from earlier:

Your user asks a question.

The retrieval system tries to find the most relevant information.

If it fails, the model fails.

So what decides whether retrieval succeeds?

How you structure and prepare the data.

This is the part no demo talks about.

Why Indexing Isn’t Enough — You Need Ingestion

Most people think:

“We’ll just chunk our docs and push them into Pinecone.”

That’s not ingestion. That’s dumping.

Proper ingestion is curation, segmentation, and semantic structuring.

You need to:

Clean the content (remove footers, junk text, irrelevant sections)
Split intelligently (not just every 500 characters)
Preserve relationships (e.g. question + answer, section + header)
Tag metadata (source, author, timestamp, type)
Embed using the right model (some are better for short queries, others for long-form)

Chunking: The Hidden Minefield

Most RAG failures come down to bad chunking.

If your chunk is:

Too long → it never gets retrieved
Too short → lacks meaningful context
Split mid-sentence → loses meaning
Contains dense code/docs → model can’t parse structure

…you’re injecting garbage into the prompt.

And remember:

LLMs don’t reason over your entire corpus.
They only see the few chunks you retrieved — in a window capped by token limits.

If those chunks are bad, it’s over.

Embeddings: Not All Are Equal

The purpose of an embedding model is simple, but critical:

To map semantically related inputs close together in vector space, so that a query retrieves meaningfully relevant chunks.

But semantic relationships aren’t fixed — they shift with domain, task, and context.

In general-purpose domains, off-the-shelf models like OpenAI’s text-embedding-3-small might work:

“Refund policy” and “money-back guarantee” might be embedded closely
“Cancel subscription” and “stop membership” land nearby

But in your company’s knowledge base?

“Workflow” might mean approval rules in legal, automation in ops, or DAGs in engineering
“Runbook” could refer to on-call procedures or ML model deployment steps

These distinctions don’t exist in general language models — and they won’t be captured in their embeddings.

That’s when off-the-shelf breaks down.

To retrieve the right chunks, your embedding space needs to reflect your world, not just the internet’s.

And that’s where fine-tuned embedding models come in:

Aligned to your jargon, naming conventions, and relationships
Trained to treat “access token” and “JWT” as close, if that’s how your org writes
Able to embed meaning that’s invisible to a model trained on Stack Overflow and Wikipedia

So yes, vector search works.

But without embedding models that understand your context, you’re just retrieving based on someone else’s semantics.

And that’s where most RAG pipelines silently fail.

Metadata + Filtering = Precision

RAG gets exponentially better when you add:

Document-level metadata (source, product, region, date)
Filters to narrow down scope (e.g., “only look at API docs”)
Hierarchical indexing (parent-child chunking with recall context)

Why?

Because relevance isn’t always semantic. Sometimes it’s structural.

“Refund policy” may match 20 pages —
But only the one from 2024, authored by Legal, is correct.

RAG Is Not Plug-and-Play

In simple demos, RAG looks magical. Ask a question → get a grounded answer.

But in production?

RAG is a data engineering problem disguised as an NLP trick.

And the teams who succeed with it are the ones who:

Treat document ingestion like software engineering
Own their preprocessing pipeline like an ML pipeline
Monitor retrieval quality, not just model latency

So yes, LLMs are powerful.

But in RAG?

The system only works if your index is a reflection of reality.
And building that index… is where the real engineering lives.

Where RAG Fails (And Why It’s Not Magic)

At this point, RAG might sound like the cleanest solution to the static LLM problem — and in many ways, it is.

But here’s the real picture:

RAG isn’t a silver bullet. It’s a layered system — and every layer can break.

Let’s walk through the most common failure points.

1. Relevant Chunk Not Retrieved → Irrelevant Answer

This is the most frequent failure mode — and the easiest to miss.

If the retriever doesn’t surface the right chunk, the LLM will confidently answer using whatever is closest, even if it’s wrong.

You’ll see:

Outdated policies getting returned
Answers pulled from unrelated but similar-sounding chunks
Hallucinated claims based on misleading context

What’s broken here isn’t the model.

It’s retrieval quality, and that’s downstream of bad chunking, poor embeddings, or inadequate metadata filtering.

2. Model Gets the Right Context — But Ignores It

Sometimes the retriever does its job.

You get the right chunk. The context is injected. Everything looks good.

But the answer?

Wrong, generic, or completely detached from the provided data.

What’s happening here?

The model isn’t grounded. It’s guessing — blending pretrained knowledge with your context instead of sticking to it.

This happens when:

There’s no clear instruction to use only the retrieved content
The question is vague, but the context isn’t enforced
The model’s prior training overrides the injected source

The result: plausible answers that contradict your ground truth.

This is especially dangerous in compliance or legal workflows, where hallucinating within context is worse than not answering at all.

RAG isn’t just retrieval. It’s retrieval plus constraint.
Without both, you’re just helping the model hallucinate better.

3. Semantic Mismatch Between Query and Chunk

This one’s harder to spot.

Let’s say your index includes:

“Users are entitled to a full refund within 7 days.”

But the user asks:

“Can I cancel and get my money back?”

If your embeddings or retrieval method can’t connect “cancel” → “refund”, or “money back” → “entitled”,

→ That chunk won’t surface.

This is where language gaps, jargon, and undertrained embedding models create false negatives.

It’s not about bad data. It’s about a missed semantic bridge.

4. Information is Split Across Chunks

Sometimes the answer isn’t in a single chunk — it lives across two or three.

E.g.:

Chunk A says: “Refunds available for digital products.”
Chunk B says: “Refunds must be requested within 7 days.”

Both are required for a full answer.

But standard (naive) RAG systems don’t do multi-hop synthesis well, especially if chunk order isn’t preserved or coherence is lost in truncation.

Unless you’ve designed your chunking and scoring to preserve continuity,

→ You get partial answers, or worse, confident contradictions.

5. The Right Data Isn’t in the Index at All

This is a classic ingestion blind spot.

Sometimes, the most relevant information:

Lives in a format you didn’t ingest (e.g. image-based PDFs, buried tables, raw HTML)
Was missed due to bad parsing
Was updated in the source system, but your index is stale

RAG can’t retrieve what isn’t there.

That’s why index observability and refresh strategies are part of any serious RAG system — not just “nice to have.”

RAG Doesn’t Fail Loudly — It Fails Silently

And that’s the danger.

Unlike traditional software bugs, most RAG failures look like they worked:

The model gives an answer
It’s grammatically correct
It sounds plausible

But it’s wrong. And it’s grounded in the wrong data.

So you need to build for:

Retrieval monitoring
Prompt observability
Failure evaluation beyond accuracy metrics

Because in RAG, silence isn’t success.

It might be a confident lie — and that’s the hardest kind to debug.

Beyond Just Docs — Smart RAG Systems

At this point, it’s clear:

Naive RAG breaks in all the places that matter.

It retrieves the wrong thing. Or it retrieves the right thing and still gives the wrong answer.

It doesn’t know when to say “I don’t know.”

It doesn’t improve over time.

But that’s not the limit of what RAG can be.

Let’s talk about how smart systems fix it.

The Mental Shift

Most people think RAG = “vector search over some docs.”

But in production, RAG becomes an optimisation problem:

How do we consistently retrieve the most useful, most answerable context — while minimizing noise, cost, and latency?

Smart RAG systems don’t just retrieve.

They rank, filter, score, adapt, and learn.

Let’s walk through the upgrades that transform fragile RAG into robust retrieval-first infrastructure.

1. Hybrid Retrieval Fixes Semantic Blind Spots

Problem it solves:

Vectors alone miss exact matches, rare entities, and domain-specific tokens.

Fix:

Use both:

Dense retrieval (for semantic similarity)
Sparse retrieval (BM25 or keyword scoring for exact matches)

This immediately improves:

Retrieval precision on short, vague queries
Performance on identifiers (like error codes, SKUs, names)

Hybrid retrieval makes your system semantic and literal — exactly when it matters.

2. Re-ranking Models Pick the Right Top-K

Problem it solves:

Top-K based on cosine similarity ≠ most useful chunks.

Fix:

After initial retrieval, use a cross-encoder (like Cohere Rerank or BGE-Reranker) to rescore the candidates for:

Factual match
Answerability
Coverage

This reranking happens before prompt injection, and it massively reduces hallucinations.

Smart RAG doesn't just ask “what’s similar?”
It asks: “what actually answers the question?”

3. Metadata Filtering Reduces Retrieval Scope

Problem it solves:

Retrieving from everything leads to irrelevant or outdated context.

Fix:

Leverage metadata added during ingestion:

product_version = "2.1"
source = "legal"
created_at >= last_month

This constrains search to what’s actually relevant before you rank.

Your data isn’t flat. Treating it like it is kills precision.

4. Semantic Chunking Improves Retrieval Quality

Problem it solves:

Poor chunking creates semantically meaningless embeddings.

Fix:

Don’t split blindly every N tokens. Instead:

Chunk by sections, paragraphs, headings
Use sentence boundary detection
Keep context units (e.g., question+answer, method+docstring) together

Optional: Add contextual overlap and parent references for coherence.

You’re not embedding text. You’re embedding meaning units. Chunk accordingly.

5. Retrieval Feedback Loops Make the System Learn

Problem it solves:

You don’t know which chunks actually helped the LLM answer well.

Fix:

Track:

Which chunks were injected
Whether the response was accepted/clicked
Retrieval-to-output overlap (did the model use the retrieved info?)

Then:

Up-rank useful chunks over time
Down-rank misleading ones
Use hard negatives to fine-tune better embeddings

RAG isn’t static. If it doesn’t learn, it decays.

Advanced Patterns (When You’re Ready to Scale)

Once your retrieval foundation is solid, these advanced patterns can unlock more robust and intelligent behaviour:

Multi-Query RAG
→ Reformulates a single user query into multiple semantic variants, retrieves for each, and merges results.
Boosts recall for vague or underspecified questions, especially in sparse or noisy corpora.
Knowledge Graph-Augmented RAG
→ Uses a graph of entities and relationships to guide retrieval.
Enables structured reasoning, cross-doc linking, and retrieval based on relationships, not just raw text.
Multi-Agent RAG
→ Chains specialised agents (retrievers, verifiers, planners) to dynamically reformulate queries, rerank chunks, and validate answers.
Useful for multi-hop queries, tool integration, and dynamic workflows.

These are not required to get started — but they represent the future of RAG at scale.

Evaluating RAG Systems — The RAG Triad Framework

Building a RAG system is one challenge.

Knowing if it works — that’s a different problem.

And here's where most teams go wrong:

They evaluate the model’s answer, not the retrieval system behind it.

That’s like debugging a recommendation engine by checking if the “Buy” button was clicked, without knowing what products were shown.

RAG needs its own evaluation lens.

And that’s where the RAG Triad Framework comes in.

The RAG Triad

To truly evaluate a RAG pipeline, you need to assess three interdependent components:

Retrieval Quality
Faithfulness
Answer Quality

These are distinct, and failures in one don’t always show up in the others.

Let’s break them down.

1. Retrieval Quality

“Did we fetch the right context in the first place?”

The first test of any RAG system is whether the retriever surfaced the relevant, sufficient, and necessary information for the query.

Key metrics:

Recall@k: Was the gold/relevant chunk in the top-k retrieved?
Precision@k: Of the chunks retrieved, how many were actually relevant?
Retrieval overlap: Did the model actually use any of what was retrieved?

How to measure:

Human-labelled gold chunks (if available)
Embedding-based similarity to expected answer
Heuristic filters like answer-span matching

Without good retrieval, the model is guessing.

Bad retrieval = good prompt = bad answer.

2. Faithfulness

“Did the model stay true to the retrieved context?”

RAG was supposed to fix hallucination.

But if your model mixes retrieved content with pretrained priors, it’s just inventing grounded-sounding fiction.

Key metrics:

Context overlap: Does the answer contain text or meaning from retrieved chunks?
Faithfulness score (via QA or entailment models): Does the answer only use supported info?
Contradiction flags: Does the output contradict any retrieved source?

How to measure:

Use natural language inference (NLI) models to compare output vs. context
Extract claims from outputs and check if they are supported by any chunk
Human eval with source checking

This is not about correctness — it’s about alignment with the context.

A factually wrong answer that used the context correctly is a retrieval issue.
A factually wrong answer that ignored context is a generation issue.

3. Answer Quality

“Was the final answer useful, clear, and complete?”

This is the traditional metric most teams already focus on:

Fluency
Task completion
Helpfulness
Hallucination (at the output level)

Key metrics:

ROUGE/BLEU/METEOR (if ground truth exists)
Judgmental scores (e.g. helpful/unhelpful from human labelers)
LLM-as-a-judge methods (scoring based on criteria)

But answer quality without retrieval traceability is deceptive:

A “great” answer from hallucinated data is a future failure waiting to happen.

Why the Triad Matters

Evaluating just the final answer is like checking the tip of an iceberg.

A fluent answer that sounds right could still be completely wrong, because either:

The wrong chunk was retrieved (retrieval failure)
The model ignored the context and guessed (faithfulness failure)
Or the answer, while correct, was vague or incomplete (answer quality failure)

Here’s the mental model:

If retrieval fails, the model never had the right info to begin with. If faithfulness fails, the model had it — but didn’t use it. If answer quality fails, the system did everything right — but failed to communicate.

You don’t fix these by “prompting better” or “changing the model.”

You fix them by diagnosing the exact stage that broke, and improving that layer of the system.

That’s what the RAG Triad lets you do:

treat RAG as a system, not a monolith.

Congratulations you’re no longer a Noob. You’re a Builder.

You didn’t just read about RAG.

You rebuilt how you think about it, layer by layer.

You now know it’s not a trick to bolt onto a chatbot.

It’s an architecture that lives or dies by how well you:

Shape the data
Design the retrieval
Control the grounding
Monitor the system
Evaluate the full pipeline

You’ve moved past the “let’s vectorise our docs” stage.

You’re now thinking like a retrieval architect — with clarity on where things break, and how to build them to last.

And that’s the real unlock.

Because in a world rushing to plug in LLMs, those who master retrieval will quietly run circles around everyone else.

Welcome to the deep end.

Stay dangerous.

Why Every AI Builder Needs to Understand MCP

Shivani Virdi — Wed, 30 Apr 2025 16:58:22 GMT

LLMs are powerful.
But they weren’t designed to operate in real-world environments on their own.

They can generate text.
But they don’t know how to access tools, query APIs, fetch files, or maintain long-term memory—unless you manually wire those systems together.

And that’s the problem.

Every time you want your model to do something useful, you:

Build another wrapper
Hardcode another integration
Stitch another brittle prompt flow

There’s no shared interface between the model and the systems it needs to work with.

This doesn’t scale.

Subscribe now

That’s where MCP (Model Context Protocol) comes in.

In this issue, we’ll break down:

What makes today’s LLM integrations fragile and repetitive
How MCP (Model Context Protocol) introduces clean separation between models and tools
How it works under the hood
And how to use it to build a fully modular Second Brain with Claude

Let’s dive in.

The Application Layer Reality

Most LLMs today are built for one thing:
text prediction inside a context window.

That’s not the same as system behavior.

Real applications need more than language—they need structure.

When you step into the application layer, the model is just one piece.
To build a working system, you need to:

Access internal documents and file systems
Pull live information from the web
Query databases and APIs
Chain steps across multiple turns
Maintain memory over time
React to inputs and return usable outputs

But LLMs can’t do any of that out of the box.
Not without external scaffolding.

And right now, that scaffolding is usually handwritten:

You bolt on a wrapper
Patch in a prompt
Hardwire tool logic into your flow

Let’s make this concrete.

Example: A Research Assistant

Say you’re building a Research Assistant with an LLM.

You want it to:

Search internal knowledge bases
Summarise findings from the web
Pull structured insights from APIs
Organise output into clean project notes

But to make that work, you’ll need to:

Write wrappers for file access
Plug into search endpoints
Build prompt flows to chain queries and outputs
Manually inject context between each step

The LLM isn’t acting like a system.
You are, by glueing things together behind the scenes.

It’s not composable.
It’s not scalable.
And it's definitely not reusable.

And up until now?
There was no standard way to do it.

Everyone built their own bridges—fragile, bespoke, duct-taped into existence.

Every new tool meant a new integration.
Every platform shift meant rewriting half your code.
Every update felt like starting from scratch.

The result?

Brittle architectures that break under their own weight.

And that’s the hidden bottleneck that stops LLMs from scaling into true modular systems.
It’s the bottleneck that MCP was designed to solve.

The Fragile State of LLM Applications Today

At first, the hacks seem fine.

You hardcode a wrapper here, a prompt flow there.
You scrape together a way to get the model working with your file system, a search API, and maybe a memory loop.

And for a while, it holds.

But then the system grows.

You add a new capability.
A new tool.
A new workflow.

And suddenly, everything feels fragile.

Because every piece is tightly coupled:

Your API logic is baked into your prompts
Your memory patch relies on exact formatting
Your file reader is wired directly into the model context

Nothing is modular.
Everything is handcrafted.

You tweak one thing, and something else breaks.

You try to reuse logic in a new assistant, but the wrappers were written for one use case only.

You upgrade your model, and the entire flow has to be debugged from scratch.

The system doesn’t scale.
It mutates.
And you’re stuck holding it together.

Why This Matters

This isn’t just frustrating.
It’s the core bottleneck behind most LLM systems today.

You’re not failing because the model can’t reason.
You’re failing because there’s no standard interface between your model and the real world.

Up until now, you had two options:

Build everything custom
Or limit what your app could actually do

Neither is sustainable.

Scaling Breaks Everything

Let’s say your Research Assistant is live.
It pulls documents, searches the web, queries APIs—you wired it all up manually.

Now your team wants to expand.

They ask for:

A Sales Assistant that pulls customer data from the same database
A Project Manager bot that uses the same search functionality
A Marketing agent that also pulls docs and formats summaries

Each of these needs:

File system access
Internet search
Access to structured APIs

In a perfect world, you’d just plug them into what you already built.

But that’s not what happens.

You rewrap the same logic again and again:

The Sales bot gets a new DB wrapper
The PM bot gets its own search integration
Each assistant has its own custom prompt chain, tied to a slightly different tool flow

What started as one app using three tools becomes three apps using the same tools—but with three separate integrations each.

M apps × N tools = M×N custom connections

Every line of integration is fragile.
Every wrapper is bespoke.
Every update means rewriting across the stack.

Instead of scaling, you multiply chaos.

The result?

Your simple system has become a house of cards.

This is the point most AI systems start breaking down, not because the model can’t handle the work,
But because the surrounding architecture wasn’t built to scale.

There’s no shared layer.
No reusable interface.
No clean way to connect multiple apps to the same set of capabilities.

Up until now.

MCP: The Missing Infrastructure Layer

If you’ve worked on any non-trivial LLM system, you’ve probably felt it:

The model isn’t the hard part. The integration is.

Same tool. New integration. Every time.

That’s the real cost of building without a standard interface.

MCP (Model Context Protocol) exists to fix this.

It’s an open protocol that defines how models can interact with external capabilities—
like tools, data, prompts, and memory—using structured, discoverable interfaces.

No wrappers.
No one-off glue logic.
No bespoke JSON hacks just to reuse the same capability twice.

With MCP:

You expose a capability once—as a resource, tool, or prompt
Any MCP-compatible model can discover and use it
You stop rewriting integration code across agents, workflows, or frontends

It’s not a framework.
It’s the interface layer that’s been missing from LLM systems.

How MCP Works

Modern LLM systems fail when the model has to know too much about how tools are built.

LLMs aren’t meant to know how tools work.
And tools shouldn’t care what model is calling them.

MCP enforces that separation.

It introduces a simple but powerful structure:
A clean boundary between where the model runs, where protocol logic lives, and where external capabilities are exposed.

The Core Actors: Host, Client, Server

Here’s the basic architecture:

Host
The application running the model—like Claude Desktop, an IDE plugin, or an agent runtime.
It provides the user experience and handles orchestration.
Client
A protocol engine that lives inside the host.
It connects to MCP-compatible servers, sends requests, handles responses, and manages the message lifecycle.
Server
A standalone process that exposes capabilities—like tools, resources, memory, or prompt templates—via typed interfaces over MCP.

Each actor has a single responsibility:

✅ Hosts don’t need to know how tools are implemented
✅ Servers don’t need to know how their output is displayed
✅ Clients translate between the two reliably and predictably

They speak the same protocol: JSON-RPC over a flexible transport layer.

This structure is what makes MCP scalable:

One model. Many servers. Clean boundaries. No shared assumptions.

How They Communicate: The Message Flow

Every interaction between an MCP Client and Server follows the same protocol:

Structured, typed messages using JSON-RPC 2.0.

Core Message Types

MCP supports four core message types:

Request — Sent by the client to ask the server to perform an action
Response — Sent by the server when a request succeeds, returning a result
Error — Sent by the server when a request fails
Includes a code, a message, and optional data (for structured debugging)
Notification — One-way messages that don’t expect a response
(e.g., server announces a tool list update)

These messages are typed, versioned, and schema-validatable, making them consistent and extensible across tools and clients.

The Lifecycle

An MCP session follows a predictable sequence:

1. Initialization

The client sends an initialize request, declaring its protocol version and capabilities
The server responds with its own capabilities
The client confirms readiness with an initialized notification

2. Communication

The client issues typed requests to the server (like callTool, readResource)
The server replies with results or structured errors
Either side can send notifications to announce updates or state changes

3. Shutdown

Either the client or the server can initiate a clean shutdown via shutdown and exit

All messages are logged and type-checked, making the protocol reliable for large-scale applications and easy to debug.

How Messages Move: The Transport Layer

While the message structure is always JSON-RPC, MCP supports flexible transport options:

stdio — Fast, local, ideal for small or embedded tools
http — Stateless, commonly used for hosted deployments
sse (Server-Sent Events) — Persistent streams for long-lived processes
Custom transports — You can define your own, as long as JSON-RPC is preserved

Whatever the transport, the message contract stays the same.
That’s what makes MCP modular—infrastructure can change, but the interface doesn’t.

What Servers Can Expose

Every MCP server can expose one or more structured capabilities to the client.
These are called primitives, and each follows a well-defined schema and discovery flow.

Resources

Expose structured or dynamic content—like file systems, APIs, or generated documents.

Clients can:

List available resources
Read or stream their content
Subscribe to updates via notifications

Prompts

Serve reusable, parameterised prompt templates.

Clients can:

List available prompts
Preview how they behave
Invoke them with specific input

This allows you to standardise reasoning steps, formatting, or task flows.

Tools

Expose executable actions—each defined with a JSON schema for input and output.

Clients can:

Discover tool metadata
Call tools with structured arguments
Receive typed, predictable results

This makes functions callable like APIs—without custom glue.

Sampling

Let the server initiate model completions using the host’s LLM.

This is used when the server needs to ask the model a question, for planning, intermediate reasoning, or delegated generation.

It supports:

Structured prompts sent from the server
Completions returned from the client-side LLM
Use in multi-agent chains or feedback loops

Together, these primitives allow servers to offer rich, typed capabilities—
and allow clients to discover, use, and compose them without writing bespoke logic per app.

From M×N to Modularity

Earlier, we saw what breaks:

M apps × N tools = M×N custom integrations

Every new assistant.
Every shared capability.
A new wrapper. A new prompt chain. A new point of failure.

Now let’s look at what happens with MCP.

Say you’re building the same three assistants:

A Research Assistant that needs to access internal documents
A Sales Assistant that needs to query customer data via API
A Project Manager that needs to fetch live market insights from the web

Each assistant depends on:

A document store
A structured internal API
A search interface

With MCP, those capabilities are exposed once, as servers.

Each tool or resource is packaged as an MCP server, with a typed, discoverable interface:

A Filesystem server
A Web Search server
An API server

Each assistant connects to those servers through a dedicated MCP client.

Clients maintain a 1:1 connection to each server
The host (e.g. Claude Desktop) manages these connections per server
Assistants issue typed requests through the appropriate client
Servers respond with structured results—no custom glue required

There’s no duplicated wiring.
No per-app integration logic.
No more M×N explosion.

You don’t wire logic into every app.
You expose capabilities—and let apps connect to them.

Why It Matters

This is how you solve the real scaling bottleneck:

Capabilities are written once
Interfaces are reused across assistants
Behavior is separated from infrastructure

With MCP:

Models use tools they weren’t handcoded for
Tools don’t depend on prompts or app logic
Assistants become orchestrators—not wrappers

You stop stitching things together.
You start composing real systems.

Building Your Second Brain with Claude and MCP

Understanding MCP is one thing.
Building with it is where the design comes alive.

And there’s no better place to start than setting up a modular Second Brain—
an AI system that can search, retrieve, reason, and grow without glue code or prompt hacks.

The Setup: Claude + MCP Servers

At the heart of this setup:

Claude Desktop acts as the host and client
(Already supports MCP natively)
MCP Servers are lightweight programs you run locally
(Each one exposes tools, resources, prompts, or memory as structured capabilities)

When Claude connects to these servers, it can use them dynamically—
No brittle wrappers, no hardwired integrations.

Minimal Second Brain Stack

✅ Filesystem Server
Enables Claude to read and write files from a designated directory
Secure, modular file access
✅ Web Search Server
Enables real-time internet search using Brave’s API
Private, dynamic querying
✅ Memory Server
Provides persistent memory across sessions using a graph-backed store
Recall facts, retain context
✅ Sequential Thinking Server
Enables multi-step reasoning and internal thought chaining
Execute multi-turn tasks naturally
✅ Notion Server
Enables interaction with your Notion workspace
Fetch, update, and manage pages + databases

Quick How-To: Setting it Up

Install Claude Desktop
(If you haven’t already)
Edit the Config
Inside
```
Settings → Developer → Edit Config 
```
Paste the following:

{
    "mcpServers": {
      "filesystem": {
        "command": "npx",
        "args": [
          "-y",
          "@modelcontextprotocol/server-filesystem",
          "/Users//ClaudeFileSystem"
        ]
      },
      "brave-search": {
        "command": "npx",
        "args": [
            "-y",
            "@modelcontextprotocol/server-brave-search"
        ],
        "env": {
            "BRAVE_API_KEY": ""
        }
      },
      "memory": {
        "command": "npx",
        "args": [
            "-y",
            "@modelcontextprotocol/server-memory"
        ]
      },
      "sequential-thinking": {
        "command": "npx",
        "args": [
            "-y",
            "@modelcontextprotocol/server-sequential-thinking"
        ]
      },
      "notion": {
        "command": "npx",
        "args": ["-y", "@suekou/mcp-notion-server"],
        "env": {
            "NOTION_API_TOKEN": ""
        }
      }
    }
}

✅ This tells Claude exactly how to launch and manage each server.

(Reminder: replace , , and with your real values.)

Restart Claude Desktop
It will automatically detect and connect to the servers.

✅ Done.
You now have a modular Second Brain running locally.

Why This Setup Matters

Each server adds one clean capability—
without coupling, without rewriting, without fragile dependencies.

You can:

Add tools by spinning up new servers
Swap implementations without changing app logic
Extend workflows without rebuilding everything from scratch

Today, it’s a Research Assistant.
Tomorrow, it’s a Knowledge OS.
The day after, it’s a fully modular agentic workspace.

All of it powered by shared infrastructure—not hardcoded glue.

From Brittle Hacks to Modular Systems

Building AI systems isn’t about throwing prompts at a model.
It’s about creating structure that scales.

The old way?

Manual wrappers
Prompt spaghetti
One-off tool integrations

Every new capability came with a new risk.

MCP changes that.

It’s not just a way to connect LLMs to tools.
It’s a protocol layer—
A boundary between models and the world.

Hosts focus on orchestration
Clients speak the protocol
Servers expose typed, reusable capabilities

Whether you’re setting up a Second Brain or architecting multi-agent ecosystems—
MCP gives you the system layer that LLMs were always missing.

And in the future of AI systems?

Modularity isn’t a preference. It’s the requirement.

References and Further Reading

How GPTs Learn to Be Helpful

Shivani Virdi — Wed, 23 Apr 2025 16:22:18 GMT

Introduction

This issue is a continuation of Part 1.
If you haven’t read it yet, start here

If you’ve ever wondered how ChatGPT became ChatGPT—the assistant that can explain quantum mechanics and gracefully decline weird requests—the answer lies in what happens after pertaining.

The raw model underneath is powerful, yes.
But it’s not polite. Not helpful. Not safe.
It doesn’t know when to say “I don’t know” or how to actually assist.

That’s because pretraining gives you a brain, a lossy internet simulator trained to predict the next token.
Post-training is what gives it a personality. A purpose. A grip on behaviour.

This issue dives into the second stage of model development:

How supervised fine-tuning teaches helpfulness
How reinforcement learning reshapes behaviour
Why hallucinations still happen—and what that reveals
And how these layers set the stage for building AI, you can actually use

If Issue 1 was about how GPTs are born,
This one is about how they grow up.

Subscribe now

Post-Training

Supervised Fine-Tuning (SFT)

From Internet Simulator to Instruction-Following Assistant

When people interact with ChatGPT, they’re often surprised by how helpful, conversational—even human—it feels. But that behaviour isn’t a natural byproduct of pretraining.
It’s taught after the fact.

So what is Supervised Fine-Tuning (SFT), really?

Let’s revisit the mental model.

The base model is just an internet simulator—a system trained to predict the next token across trillions of examples from web pages, books, forums, and Wikipedia.

It has raw language ability, but no sense of how to be useful.
It doesn’t know when to say “I don’t know,” how to follow instructions, or even what it means to answer a question.

It’s a brain with no behavior.

SFT is the first step in post-training, where we take that raw capability and teach it how to act like an assistant.

It’s trained on thousands (sometimes millions) of human-written conversations, structured like this:

Human: [Instruction or question] 
Assistant: [Ideal response]

This is where the model learns to:

Follow instructions
Be helpful and polite
Refuse unsafe requests
Admit uncertainty
Show reasoning steps

Without this step, the model would just remix plausible-sounding text. It wouldn't behave.

Where does this dataset come from?

In the early days of supervised fine-tuning, humans wrote everything from scratch.

Labellers followed detailed guidelines on how to be helpful, truthful, and non-biased, often working from instruction manuals hundreds of pages long. They would:

Take prompts like “What are some startup ideas for Gen Z?”
Write out the ideal assistant response, word for word
Repeat this process across thousands of diverse scenarios

While this gave us high-quality data, it was slow, expensive, and hard to scale.

As models improved, a smarter approach emerged:
Instead of writing everything from scratch, we generated responses using a strong model (like GPT-4) and had humans review and filter them.

This process generated synthetic SFT data:

A powerful model creates multiple responses
Humans rank or pick the best ones
Only top completions are added to the training set for smaller models

Synthetic SFT is faster, cheaper, and scalable, as long as quality is controlled.

Without careful monitoring, the assistant could start imitating the bad habits of the model that generated its data.

Thus, synthetic SFT still requires:

Strong evaluation filters
Spot checks for hallucinations and unsafe content
Clear labelling instructions to preserve alignment

It’s not fully automated—it’s just a more efficient way to scale human preferences into usable training data.

Why do hallucinations still happen?

Supervised fine-tuning teaches models to behave like assistants, follow instructions, sound fluent, be clear and confident.

But here’s the catch:

The model isn’t verifying facts. It’s still just predicting the next token.

So when you ask:

“Who won the Nobel Prize in Physics in 2023?”

It doesn’t look it up.
It searches its training patterns for what might come next—and if it hasn’t seen that fact clearly and repeatedly, it guesses.

And it does so confidently.

Why? Because SFT trains on well-structured, assertive responses.
The model learns not just what to say, but how to say it.

So even when it doesn’t know, it defaults to the tone it was rewarded for:

Fluent. Certain. Complete.

That’s what makes hallucinations dangerous.
They don’t sound unsure—they sound right, even when they’re not.

So, how do we reduce hallucinations?

Better SFT data
→ More factual, grounded, high-quality examples reduce the need to guess.
Refusal training
→ Teach the model to say “I don’t know” or “I can’t answer that” through labelled examples. Without them, it assumes confidence is always rewarded.
External tools
→ Let the model call a search engine or calculator. If it doesn’t know, it can reach instead of fabricate.

Because the real problem isn’t just a lack of facts.

It’s that the model’s been trained to act like it knows, even when it doesn’t.

And if you want the truth, that behaviour has to be retrained or redirected to something more reliable.

What about self-awareness?

Here’s a fun surprise:

LLMs don’t actually know who they are.

Ask a base model, “What model are you?”
Unless it’s seen that exact phrasing during training, it has no reason to say,

“I’m ChatGPT, based on GPT-4.”

Why?

Because there’s no internal identity.
The model’s not reflecting on its architecture—it’s just predicting the next likely token.

To make it sound self-aware, you have to train or tell it to act that way.

Two ways to do that:

Supervised Fine-Tuning
Include Q&A like:
```
Human: What model are you?
Assistant: I am ChatGPT, trained by OpenAI.
```
→ The model learns this identity as a pattern.
System Prompts
Inject context like:
```
“You are ChatGPT, based on GPT-4.”
```
→ Shapes behaviour at runtime without retraining.

These two techniques reflect two kinds of memory:

Parameter memory → baked into the model weights
Context memory → fed dynamically via the prompt

Only one of them, context memory, can you change post-deployment.

So, when a model “knows” who it is

It’s not awareness. It’s pattern repetition.

Why LLMs Struggle with Math (and Counting)

This is where the “token predictor” mental model hits its limit.

LLMs don’t solve problems.
They generate text that looks like a solution, one token at a time.

So when you ask:

Emily buys 3 apples and 2 oranges. Each orange costs $2. The total is $13. What’s the cost of the apples?

You want:

2 oranges = $4 → 13 – 4 = 9 → 9 ÷ 3 = $3 each.

But unless it’s seen that exact reasoning pattern before, it might:

Jump to a guess like “$9”
Mix up the logic midway
Or confidently output something totally wrong

Why?
Because math has no redundancy—one wrong token and the whole answer collapses.
And remember:

The model isn’t calculating. It’s completing a pattern.

Why It Fails at Counting Too

Now try:

How many R’s in 'strawberry'?

Simple? Not for an LLM.

It doesn’t see characters like we do.
It sees tokens—maybe “straw” and “berry,” maybe merged.
So, asking it to count letters? It guesses—and often misses.

This isn’t a reasoning error.
It’s a representation problem. It was never trained for this.

So What’s the Fix?

Don’t make the model pretend. Let it call tools.

A good system detects:

“This needs string ops.”

Then routes it to a Python interpreter:

"strawberry".lower().count("r") → 3

The model didn’t “know” the answer.
It delegated and got it right.

That’s modern LLM architecture in a nutshell:

Know when to predict, and when to execute.

TL;DR — Mental Model Recap

LLMs don’t calculate, they predict
Math breaks because prediction ≠ computation
Counting fails because tokens ≠ characters
The fix isn’t “train harder”, it’s tool use

SFT teaches helpfulness.
But if you want strategy discovery and reasoning, you need more than imitation.

Reinforcement Learning (RL)

Teaching the Model What Works—Not Just What to Mimic

Supervised Fine-Tuning (SFT) helps the model behave like an assistant.
It teaches the “how” of being helpful—polite responses, refusals, and multi-turn structure.

But at the end of the day, it’s still mimicry.
The model is learning to copy what humans wrote, not necessarily what works best for the model.

That’s where Reinforcement Learning comes in.

Why SFT Hits a Wall

There are two core problems with stopping at SFT:

Human answers ≠ optimal for LLMs
An answer that feels intuitive to us may be inefficient or awkward for the model to generate.
LLMs don’t reason like us—they pattern-match across tokens.
Multiple answers can be right.
SFT locks the model into reproducing a single (or similar) solution. However, in many cases, there are several valid ways to answer a prompt, and SFT doesn’t let the model explore them.

We need a way to let the model try different approaches and learn which ones lead to success.
That’s what RL enables: exploration + reward.

The Mental Model: School, but Smarter

Think of the full training pipeline like education:

Pretraining is like reading every book in the library.
The model absorbs a massive amount of language and knowledge—but hasn’t practiced anything.
SFT is reading worked-out solutions.
The model sees how experts would respond and learns to imitate that structure.
Reinforcement Learning is solving problems without a solution manual.
The model tries, gets feedback, and adjusts.
Over time, it learns strategies that work, not just what was shown.

That’s the core shift:
From copying to discovering.

And once the model starts doing that, it unlocks a new layer of reasoning power.

What Actually Happens During RL

Trying, Failing, and Reinforcing What Works

Let’s make it concrete.

Say the prompt is:

“Emily buys 3 apples and 2 oranges. Each orange is $2. Total cost is $13. What’s the cost of apples?”

A supervised model might just output:

“Each apple costs $3.”

Because that’s the answer it saw in training.
But it never had to figure it out for itself.

In reinforcement learning, the model isn’t shown a “correct” answer—it has to try.

It generates multiple responses
Some show step-by-step reasoning
Others skip straight to an answer
Each response is scored based on whether it gets to the correct result

In domains like math, this is easy:
If the final answer is correct → reward it
If not, → penalise it

The model is learning which token sequences tend to produce the right outcome, not which steps to copy.

Over time, the model discovers response patterns that consistently lead to success, even if it wasn’t explicitly taught during SFT.

That’s the core value of RL:

It enables the model to practice, evaluate, and refine—on its own terms.

Why This Unlocks “Thinking”

The Model Starts to Reason in Tokens

As reinforcement learning progresses, something unexpected begins to emerge.

The model doesn’t just get more accurate.
It gets more deliberate.

You start seeing:

Longer responses
Step-by-step breakdowns
Self-corrections and retries

In DeepSeek’s experiments, they found a strong correlation between answer length and accuracy, but not because the model was rambling.

The longer answers showed reasoning:

Breaking down problems
Evaluating intermediate steps
Backtracking when things didn’t add up

And here’s the catch:

No one explicitly told the model to do that.

This behaviour, what we now call chain-of-thought reasoning, emerged because the model discovered something through trial and reward:

Thinking in tokens leads to better outcomes.

That’s the power of RL:
It doesn’t just reinforce answers; it helps the model uncover how to think, based on how it computes.

Not human-style logic.
LLM-native strategies—discovered from within.

From Imitation to Mastery

We’ve seen this before, with AlphaGo.

It began by mimicking expert players, just like an LLM trained via SFT.
But imitation only took it so far.

It plateaued. Copying human moves couldn’t push it further.

To go beyond that ceiling, researchers applied reinforcement learning.
AlphaGo started playing against itself, not learning from human games, but from trial and error, guided by one reward:

Did this lead to a win?

That’s when the breakthroughs happened.

One move—“Move 37”—looked like a mistake.
It wasn’t. It was game-changing.

RL enabled the system to discover strategies that weren’t in the data.

And this is the promise of RL for LLMs, too:

Not just better answers—emergent behaviour.
Not just imitation—discovery.

But for RL to work, there needs to be a clear reward signal.

And in real-world tasks, “better” can’t always be measured in wins or losses.

That’s where we go next:
RLHF—Reinforcement Learning from Human Feedback.
Where humans define what success looks like.

Reinforcement Learning with Human Feedback (RLHF)

Aligning the Model When “Better” Can’t Be Programmed

So far, the model has learned:

Language patterns (pretraining)
Assistant-like behaviour (SFT)
Strategies that work (RL)

But all of that depends on clear rewards—something you don’t get when you want responses to be funnier, more helpful, or more human.

You recognize better when you see it. But you can’t code it.

That’s the gap RLHF fills.

The Problem: No Computable Reward

Standard RL relies on hard-coded signals, like accuracy or win/loss.

But in subjective tasks, you can’t write a rule for:

“Was this summary easy to follow?”
“Did this sound human?”
“Was this tone too robotic?”

These are judgment calls—only humans can decide what “better” looks like.

Why RLHF Works

RLHF lets the model learn from human preferences.

Here’s how:

Humans rank multiple outputs for the same prompt
These rankings become training data
A reward model is trained to score future responses like a human would
The LLM uses this model to optimise its own behaviour

It’s scalable. Humans guide once. The reward model handles the rest.

The LLM never sees the human, only a trained proxy of their taste.

That’s what lets RLHF work for subjective tasks:

“Be more helpful”
“Sound less robotic”
“Explain this more clearly”

Why Ranking Beats Writing

In SFT, humans write perfect answers. In RLHF, they just pick what’s better.
It’s faster, cheaper, and gives finer-grained feedback, because even imperfect responses show preference.

You don’t need gold data. You need signal.

The Tradeoffs: RLHF Isn’t Perfect

1. The Reward Model is an Approximation

It mimics preferences, but it’s still a model:

Can overfit to surface cues
May miss nuance
Struggles outside its training domain

If it gets the “better” signal wrong, the LLM will optimise for the wrong thing.

2. The LLM Can Game the Signal

LLMS are strong optimisers. They’ll learn to:

Repeat filler phrases to fake helpfulness
Exaggerate tone (“Absolutely! Delighted to help!”)
Output confident guesses that sound plausible but aren’t

This is reward hacking, when the model optimises the proxy, not the goal.

In some cases, models even produce adversarial outputs: completions that exploit reward model blind spots but feel totally off to humans.

The model isn’t learning what you want.It’s learning what the reward model thinks you want.

Why RLHF Must Be Cut Off

RLHF isn’t like AlphaGo-style RL. More training doesn’t always help.

Why?

Because the model isn’t optimizing for truth—it’s chasing a score made by another model.

That’s why in practice:

RLHF is run for a limited number of steps
The reward model is retrained or audited
Final checkpoints are reviewed by humans, not just scores

Push it too far, and the model learns how to game the metric—not align with intent.

So, What’s the Real Difference from RL?

Let’s draw the line clearly.

Standard RL:

Uses a hard-coded reward (e.g. win/loss, accuracy)
Works best in objective domains like games or math
The reward is precise and verifiable

RLHF:

Uses a learned reward model based on human preferences
Applies to subjective tasks like helpfulness, tone, or clarity
The reward is approximate, not directly measurable

In RL, the signal is crystal clear.
In RLHF, it’s human-aligned, but filtered through approximation.

So What Are You Really Talking To?

When you chat with GPT, you’re not talking to a mind.

You’re talking to a model that has:

Compressed much of the internet into its parameters
Learned assistant behaviour from curated examples
Discovered reasoning patterns through RL
Aligned itself with human judgment through preference modelling

It’s not magic.
It’s layers of optimisation—stacked, fine-tuned, and trained to predict your next token.

And now you know exactly how that stack was built.

LLMS in the Application Layer

How Token Prediction Turns Into Real Capability

LLMs complete text, but apps need more than completions.

Here’s how we make them useful:

Prompting → Shapes the model’s behaviour by structuring the input
RAG → Injects external knowledge that the model never trained on
Agents → Add memory, tools, and planning to go beyond one-shot replies

This is how LLMs move from chat interfaces to actual systems that get work done.

Closing Thoughts

LLMs aren’t magic—they’re layered systems.

They learn language by prediction.
They learn behaviour by imitation.
They learn strategy by reward.
And they align through preference.

But they don’t understand. They don’t reason.
They complete patterns with style, not certainty.

The real shift?

Treat them as engines to design around, not minds to build on.

Because once you do that—
You stop wrestling the model
and start building systems that actually work.

References & Further Reading

How GPTs Are Born: Internet Feeding, Token by Token

Shivani Virdi — Wed, 16 Apr 2025 14:13:19 GMT

Introduction

LLMs are everywhere—we use them, we beg them for answers, and we curse them when they mess up.

They feel almost magical—solving Olympiad-level math, writing Ph.d.-grade papers—and yet, they can’t always tell you which number is bigger: 9.11 or 9.9.

That kind of split personality can be maddening.

So how do you actually build with these systems? What makes them brilliant in some tasks, and bafflingly bad in others?

The answers aren't mystical. They come from understanding what an LLM really is—and what goes into creating the GPTs of the world.

Once you get that, you’ll start seeing patterns.

You’ll know why they fail when they do. You’ll start anticipating their quirks. And more importantly, you’ll begin to develop the kind of mental models that let you wield LLMs effectively—in daily use, and in the products you build.

Subscribe now

What Is an LLM, Really?

An LLM doesn’t memorise facts — it compresses patterns in how we speak, write, and reason. It’s a blurry statistical snapshot of how humans use language on the internet.

At its core, a Large Language Model is just a next-token prediction machine.

That’s it. It takes a sequence of tokens and guesses, statistically, what comes next.

But don’t let that simplicity fool you.

When this prediction task is scaled up—across trillions of tokens—the model starts doing more than just stringing words together. It builds internal representations of language, concepts, and structure.

It doesn’t “understand” the world like we do.

But it does learn to encode ideas like “cat,” “startup,” or even “grief” in high-dimensional space because it has seen them, again and again, in wildly diverse contexts.

It’s not conscious. It’s not sentient. (Yet ;)

But it has learned a compressed, lossy version of how humans express meaning—and it uses that to autocomplete your sentence.

One token at a time.

Note to the reader:
This issue's goal is to help you develop a strong intuition about how LLMs are created—the way we know and use them today.

There’s a ton of nuance under the hood, but much of it has been intentionally abstracted to keep the concepts accessible and the mental models sharp. Think of this as your map, not the full terrain.

How Is a GPT Born?

High-level view of the pretraining pipeline

If an LLM is a next-token prediction machine, then how does something like GPT-4 come to life?

It all starts with one goal:

Turn the entire internet into numbers—and teach a giant neural network to guess what comes next.

This process happens in three major phases:

Pretraining — Build the brain
Supervised Fine-Tuning (SFT) — Teach it to be helpful
Reinforcement Learning (RL + RLHF) — Let it discover better ways to reason and respond

Let’s break these down, starting with the most compute-heavy phase of all: pretraining.

Pretraining

Turning the Internet Into Model Food

Before your GPT model can chat, code, or give questionable dating advice, it goes through pretraining.

This is where it learns language, patterns, and structure—all by observing the internet.

To make that happen, you first need to collect, clean, and structure the data at massive scale.

Step 1: Crawl the Web

To train a language model, you first need text—a lot of it.

Big labs like OpenAI, Anthropic, and Google usually operate their own web crawlers: automated bots that surf the internet, follow links, and download publicly available pages.

But there's also Common Crawl, a massive open-source project that has indexed over 250 billion web pages since 2007 and adds billions more every month.

Whether labs use Common Crawl, their own crawlers, or both, the output is more or less the same:

A giant pile of raw web data.

Unfiltered. Untagged. Repetitive. Messy.

Before it can be used to train anything, this data needs to be cleaned, deduplicated, and structured into something the model can actually learn from.

Step 2: Clean the Chaos

Raw web data isn’t ready for training out of the box.
It needs to be heavily preprocessed before a model can learn from it.

Here’s what that cleaning typically involves:

URL filtering – Remove spammy, unsafe, or blacklisted domains (blocklists + heuristics)
Text extraction – Strip away HTML, scripts, boilerplate, and navigation junk
Language filtering – Detect and keep mostly-English pages (e.g. ≥65% English by content)
Deduplication – Use hashing techniques like MinHash to remove near-identical documents
PII removal – Automatically detect and scrub emails, addresses, and personal details
Content ranking – Weight sources like Wikipedia, books, and code repositories higher in the mix

Only after this multi-step scrubbing does the dataset become usable for training.

A good example? Hugging Face’s FineWeb—built on top of Common Crawl and C4, but curated with multiple filtering passes to create a clean, diverse corpus optimised for LLMs.

Step 3: Tokenization

Turning Text Into Numbers

Neural networks don’t read text like we do.
They work with numbers—vectors, matrices, probabilities.

So before training can begin, all that cleaned-up internet data needs to be tokenized—broken into smaller chunks called tokens, and mapped to unique numerical IDs.

But here’s the challenge:

You can’t just feed the raw characters (too long, too inefficient)
You can’t use whole words either (too many of them, and it's not agnostic to typos or new words)

Tokens hit the sweet spot—subword units that are small enough to be reusable, but large enough to be efficient.

For example, the word education might be split into:

["edu", "ca", "tion"] → [2451, 9123, 7812]

The most common technique? Byte Pair Encoding (BPE)—an algorithm that merges frequently seen letter pairs or subwords into new tokens to build a vocabulary.

Once tokenized, every document becomes a sequence of integers.

And that’s what the model trains on—a long list of numbers, learning to predict what comes next.

Not the next word.
Not the next sentence.
Just the next token ID.

Step 4: Training the Neural Network

Teaching the Model to Guess the Next Token

Now that we’ve turned the internet into token IDs, it’s time to teach the model how to predict what comes next.

At a high level, you can consider the LLM to be a giant black box—a deep neural network with billions (or hundreds of billions) of parameters.

Its job?

Take in a sequence of tokens and predict how likely each possible token in the vocabulary is, to occur next.

Let’s say we feed it the phrase:

“The cat sat on the”

The model processes this context and uses its parameters—spread across dozens (or hundreds) of neural layers—to generate a probability distribution over its entire vocabulary.

It might assign:

“mat” → 0.30
“roof” → 0.40
“idea” → 0.01
…and so on for every token it knows

The token with the highest probability is selected (or sampled from), and that becomes the next output.

That’s the entire game:
Take tokens in → guess the next one → repeat

So, how does it learn?

If the model predicts “roof” but the actual word was “mat,” it calculates a loss (usually cross-entropy), and uses backpropagation to nudge all its weights ever so slightly in the right direction.

This happens over and over—across billions of sequences.

Over time, it learns:

The structure of language
The relationships between concepts
And the common patterns behind how humans express thoughts

The result? A model that looks like it understands language—
when it’s really just getting extremely good at continuing sequences of tokens.

And somehow, from this repetitive statistical game…
emerges a system that can code, explain quantum physics, or write you a haiku.

A Peek Inside the Black Box: Transformers and Self-Attention

The Architecture That Made GPTs Possible

So far, we’ve been treating the model as a black box.
But what’s inside that black box?

It’s built using one of the most important breakthroughs in deep learning: the Transformer architecture, introduced in the 2017 paper “Attention is All You Need.”

What made it revolutionary?

It allowed models to attend to all parts of the input simultaneously, rather than sequentially like older RNNs or LSTMs. This gave them a much stronger sense of context—and made training massively parallelizable.

At the heart of this is a mechanism called self-attention.

Self-Attention: Why It Changed Everything

Take the sentence:

“The animal didn’t cross the street because it was tired.”

What does “it” refer to?

Humans instantly link “it” to “the animal.”
But traditional models struggled with this kind of long-range dependency.

Self-attention fixed that.

Self-attention allows every token in the sequence to look at every other token—
and decide how much it should “care” about them when forming its meaning.

So when processing the word “it”, the model can learn to pay more attention to “animal” than to “street.”

Under the hood, this is done by creating three vectors per token:

A query (what am I looking for?)
A key (what do I represent?)
A value (what information do I carry?)

The attention weight is computed using:

Attention = softmax(QKᵀ / √dₖ) · V

Where Q, K, and V are matrices of all query, key, and value vectors in the sequence.

This lets the model build rich representations of tokens in context, across the whole input.

But it doesn’t stop there.

LLMs use Multi-Head Attention—multiple attention mechanisms running in parallel, each learning to focus on different aspects: grammar, logic, meaning, etc.

Each “head” gets a different learned projection of Q, K, and V. The outputs are then concatenated and linearly projected again to form the final attention output.

This allows the model to attend to multiple types of relationships at once.

Why GPT Uses a Decoder-Only Transformer

The original Transformer has two components:

An encoder (to understand full input sequences)
A decoder (to generate sequences, one token at a time)

Models like BERT use the encoder.

But GPTs are decoder-only models, optimized for generation.

Why decoder-only? Because GPTs generate language one token at a time, without looking into the future.

To ensure this, they use masked self-attention, so that each token can only see previous tokens, never the ones ahead.

This is what makes GPTs autoregressive.

They take in a context, and generate the next token, then the next, and so on.

With this setup, the model becomes far more than a simple predictor.

It learns to encode structure, relationships, and meaning—all through attention.

And when you combine this with massive scale?

You get a model that doesn’t just finish your sentence—
Sometimes, it finishes your thought.

Inference: Using the Trained Model

Once training is complete, all the model’s internal weights—the billions of tiny knobs it used to learn patterns—are frozen.

This frozen model is what we now use for inference.

Inference is what happens when you type a prompt into ChatGPT and hit enter.

Behind the scenes, here’s what’s going on:

The model takes your input (already tokenized into numbers), feeds it into its deep neural layers, and computes a probability distribution over all possible next tokens—just like it did during training.

Except now, there’s no ground truth to compare against.
No loss to calculate.
No weights to update.

Inference is the model simply doing what it learned to do:
Predict one token at a time, over and over again, until it decides to stop.

How it chooses the next token depends on sampling strategies, like:

Greedy decoding – always pick the highest probability token (more predictable)
Top-k or nucleus sampling – sample from the top k likely tokens (more diverse or creative)
Temperature – controls randomness; lower = more focused, higher = more exploratory

That’s why the same prompt can sometimes give you different responses.

It’s still just playing autocomplete—
But now it’s fast, frozen, and focused entirely on generation.

Look, Ma, a Base Model!

A Raw, Unaligned Internet Simulator

Once pretraining is complete, you get what’s called the base model.

But let’s be clear upfront:

This is not the model you interact with on ChatGPT.

The base model hasn’t been fine-tuned to be helpful, polite, or even factually consistent.

What it is… is a wildly powerful token-level internet simulator.

Its only job is to predict the next token—based purely on the statistical patterns it learned from trillions of examples during training.

That’s it.

Ask it something like:

“What is 2 + 2?”

It might not say “4.”

Because it’s not doing math—it’s just trying to complete the sentence the way it saw humans do it online.

That continuation could be a quiz, a joke, or a rant about calculators.
It all depends on its training distribution.

Here are a few key mental models to keep in mind:

1. It’s stochastic, not deterministic.
Even with the same prompt, you might get different outputs.
Why? Because the model samples from a probability distribution over possible next tokens—not always picking the same one.

2. It doesn’t “know” facts—it compresses patterns.
The model doesn’t memorize the internet.
It stores a lossy, statistical abstraction of everything it’s seen inside its billions of parameters.

Think of it like:

“What’s the most probable way a human would continue this sentence, based on a blurry snapshot of the internet?”

3. It sometimes regurgitates exact data.
Certain sources—like Wikipedia, academic papers, or popular GitHub repos—are heavily represented in training.
So if you input the beginning of a famous article or block of code, the model might complete it verbatim.

This is called regurgitation—a byproduct of overfitting on specific examples.

4. It hallucinates—often.
If you ask about something obscure, ambiguous, or poorly represented in its training data…
It may confidently make things up.

Why?

Because it’s not pulling from a knowledge base.
It’s just guessing the next token based on patterns it has seen.

5. You can still prompt it cleverly.
Even in its raw form, you can get assistant-like behavior using techniques like few-shot prompting:

“Here’s how I want you to behave. Here are a few examples. Now your turn.”

It won’t be as consistent or safe as a fine-tuned model—but this is where prompt engineering begins.

So think of the base model as the brain:
Highly capable, unfiltered, and trained to mimic the internet’s statistical style of expression.

What it’s not yet… is an assistant.

For that, we need the next step: post-training.

That’s the Brain. Next Up: The Behaviour.

By now, you’ve seen what goes into building a base model—from crawling the web to teaching it how to predict tokens like a statistical wizard.

But a base model isn’t helpful. It’s not safe. And it definitely doesn’t know when to say, “I don’t know.”

To turn this raw brain into something you can actually talk to (like ChatGPT)…
We need to teach it how to behave.

That’s what we’ll explore in the next issue:

How supervised fine-tuning teaches the model to act like an assistant
Why hallucinations still happen
What makes LLMs refuse, reason, or stumble
And how reinforcement learning adds human preference—and shapes the model’s reasoning style.

Same deep-dive, same intuition-first style—see you next week for Part 2: Teaching the Model to Behave.

References & Further Reading

If you’re curious to explore the foundational material behind this issue, here are some excellent resources I’ve drawn from:

Welcome to NeoSage

Shivani Virdi — Sun, 06 Apr 2025 10:45:18 GMT

Subscribe now

Hey there 👋

Welcome to NeoSage—a technical deep-dive newsletter for engineers, solopreneurs, and AI builders who want to build in-depth intuition on all things AI—from how LLMs work to how to architect multi-agent systems.

This is where you’ll owl-ways get the insights to ride the AI wave skillfully 🦉

Why NeoSage

You’ve heard “AI will replace you.”
You tried vibe coding with your favorite LLM (rhymes with fraud), only to end up running in circles, with nothing even remotely shippable.
Now you’re wondering:
Is building consistent systems with AI even possible—or is it all hype?

That’s where NeoSage comes in:
The messiah.
The harbinger of clarity and systems thinking in a world of AI chaos.

Sure, AI is evolving at breakneck speed. But it’s not all rainbows and unicorns (though quite a few are being spun up because of it 😉).
The media hype? Overstated. The practical resources? Underwhelming.

To actually build with AI, you need to understand the what and how.
You need to think in systems.
You need to connect the dots between deterministic code and stochastic magic.
And trust me—
That prompt engineering course?
It’s not it 😅

NeoSage exists to bridge the gap between research papers and real-world engineering.

And now that it’s here—and you’re here—rest assured:
Every week, you’ll get deep breakdowns of architectures, tools, and ideas powering today's AI systems.

No fluff.
No AI hype.
Just lessons from the trenches of applied AI.

What You’re In For

As a NeoSage subscriber, you’ll get deeply technical insights delivered straight to your inbox every week.

Expect breakdowns on:

Technical Concepts (Foundational to Advanced):
AI/ML fundamentals, deep learning intuition, LLM mechanics—explained clearly.
AI Systems Architecture:
RAG, Agentic AI, building with LLMs, production-ready setups, and security considerations.
Model Deep Dives:
GPT-4o, Vision Models, DeepSeek, Gemini, and more
How-To Walkthroughs:
Tools, libraries, frameworks (MCP, LangChain, etc.)—explained with actual dev workflows.
Mini Projects + Tutorials:
Get your hands dirty with guided builds and real-world projects.

Meet Your Owl-thor

Shivani Virdi is a software engineer with over 5 years of experience building products and systems at Adobe, Amazon and now at Microsoft. Writes deep dive technical content breaking today’s AI landscape, tying research and systems together to a readership of 22K+ engineers on LinkedIn.

What the First Month Looks Like

Here’s what’s coming up in your first four issues of NeoSage:

LLMs: The What, How, Why (An engineer’s guide)
→ Understand large language models from scratch—no fluff, just real foundations that stick.
How to Use Modern-Day LLMs
→ Go beyond prompting—learn how to think with and think around LLMs to get actual results.
RAG for Noobs
→ A beginner-friendly breakdown of Retrieval-Augmented Generation, and how it powers smarter AI systems.
Build Your Second Brain with MCP (Model Context Protocol)
→ What is MCP? Where do you begin? How do you make the most of it? We’ll break it down and walk you through setting up a second brain with Claude and MCP.

Subscribe for free

Sit back and let NeoSage sharpen your AI engineering skills,
One Wednesday at a time, five minutes at a time.

Hit subscribe, share it with a friend, and let’s build the future of AI the way it was meant to be: Thoughtfully, skillfully, and at scale.

See you in your inbox,
Shivani