<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[NeoSage]]></title><description><![CDATA[NeoSage is a technical deep-dive newsletter for engineers, solopreneurs, and AI builders who want to build in-depth intuition on all things AI. This is where you’ll owl-ways get the insights to ride the AI wave skillfully.]]></description><link>https://blog.neosage.io</link><image><url>https://substackcdn.com/image/fetch/$s_!sfKp!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8266222-d17f-4639-a529-67ae92f79bb1_1024x1024.png</url><title>NeoSage</title><link>https://blog.neosage.io</link></image><generator>Substack</generator><lastBuildDate>Wed, 06 May 2026 03:27:47 GMT</lastBuildDate><atom:link href="https://blog.neosage.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Shivani Virdi]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[shivanivirdi@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[shivanivirdi@substack.com]]></itunes:email><itunes:name><![CDATA[Shivani Virdi]]></itunes:name></itunes:owner><itunes:author><![CDATA[Shivani Virdi]]></itunes:author><googleplay:owner><![CDATA[shivanivirdi@substack.com]]></googleplay:owner><googleplay:email><![CDATA[shivanivirdi@substack.com]]></googleplay:email><googleplay:author><![CDATA[Shivani Virdi]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Prompt Lifecycle Every AI Engineer Should Know]]></title><description><![CDATA[Why prompts break in production and what it takes to make them reliable.]]></description><link>https://blog.neosage.io/p/the-prompt-lifecycle-every-ai-engineer</link><guid isPermaLink="false">https://blog.neosage.io/p/the-prompt-lifecycle-every-ai-engineer</guid><dc:creator><![CDATA[Dimple Sharma]]></dc:creator><pubDate>Sat, 07 Feb 2026 15:58:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-mv5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Many people think &#8220;prompt engineering&#8221; means finding clever ways to talk to ChatGPT. And sure, if you&#8217;re turning your vacation photos into Ghibli art, that&#8217;s fine.</p><p>But when you&#8217;re building production systems that talk to LLMs through APIs? That&#8217;s a completely different problem.</p><p>Here&#8217;s the pattern: Your new AI-powered support bot is a hit. For a week, it&#8217;s the star of your engineering retrospective. Then the 3 AM PagerDuty alert fires. A silent model update broke accuracy. Someone found a prompt injection vulnerability. The cloud bill tripled. And the failing prompt? It&#8217;s a raw string, hardcoded in a dozen files, and no one can tell which version was even running.</p><p>If this sounds familiar, you&#8217;ve hit the most common production landmine: treating prompts like throwaway strings instead of mission-critical infrastructure.</p><pre><code><code>prompt = f"""Extract the user's name, order ID, and the specific issue from this support ticket. Format the output as a JSON object with the keys 'customerName', 'orderId', and 'issueSummary':\\n\\n{support_ticket_text}"""
</code></code></pre><p>Here&#8217;s why that single line of code is a ticking time bomb:</p><ul><li><p><strong>Prompt Rot:</strong> Your prompt&#8217;s behavior is tightly coupled to a specific model version. When the provider updates the model (which they do, <em>constantly</em>), the subtle patterns your prompt relied on can shift, causing performance to decay silently. The prompt &#8220;rots&#8221; without any code ever changing.</p></li><li><p><strong>The Versioning Black Hole:</strong> When a failure occurs, can you definitively say which prompt version was responsible? Without a versioning system, debugging is guesswork. You can&#8217;t roll back, and you can&#8217;t reliably reproduce successes.</p></li><li><p><strong>The Observability Black Box:</strong> Is a prompt slow? Is it expensive? Is it consistently failing for a specific user segment? When your prompt is just a string, it has no telemetry. You&#8217;re flying blind, unable to track latency, token costs, or quality scores.</p></li><li><p><strong>The Economic Drain:</strong> Hardcoded prompts are rarely optimized. They&#8217;re bloated with unnecessary verbosity or inefficient few-shot examples, leading to higher token counts that bleed your budget, one API call at a time.</p></li><li><p><strong>Security Blindspots:</strong> A raw, unvalidated string passed to an LLM is a security vulnerability waiting to happen. With <strong>prompt injection</strong>, a malicious user overrides your instructions. It is not a theoretical threat. It happens when you treat user input as trusted text.</p></li></ul><p>This is a systems engineering problem. And it demands an engineering solution.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6HJT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6HJT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!6HJT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!6HJT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!6HJT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6HJT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:256411,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/187189335?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6HJT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!6HJT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!6HJT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!6HJT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2009c5cb-93f1-4d8a-8796-8a036454ad57_1200x1200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Systemic Solution: The Prompt Lifecycle</h2><p>So, what does the solution look like in practice?</p><p>Treat prompts like critical software artifacts. Version them. Test them. Monitor them. We solved this for our application code decades ago with DevOps. The chaos of ad-hoc prompt management is not a new <em>type</em> of problem; we&#8217;re just dealing with a new <em>type</em> of artifact. The solution, therefore, is to apply the same battle-tested engineering discipline.</p><p>Welcome to the <strong>Prompt Lifecycle.</strong></p><p>This is the core mental model for shifting from fragile strings to a production-grade system. It&#8217;s a continuous, circular process for managing prompts with the same rigor as any other piece of your infrastructure. It consists of five distinct, non-negotiable stages:</p><ol><li><p><strong>Design:</strong> This is where you define the prompt not as a raw string, but as a structured, version-controlled asset. You use templates to separate logic from data and a clear schema to define its metadata, parameters, and target model.</p></li><li><p><strong>Test:</strong> Before a prompt ever sees production traffic, it must pass a rigorous, multi-layered evaluation suite. This is where you move from a subjective &#8220;looks good&#8221; vibe check to data-driven <em>proof</em> that the prompt is effective, reliable, and safe.</p></li><li><p><strong>Deploy:</strong> Once validated, the prompt is published to a centralized <strong>Prompt Registry</strong>. This creates a single source of truth, allowing your applications to fetch specific, versioned prompts dynamically without requiring a full code deployment.</p></li><li><p><strong>Monitor:</strong> After a prompt is live, you need eyes on it. This stage is about collecting critical, real-world telemetry, such as tracking latency, token costs, and quality scores to understand how the prompt is <em>actually</em> performing in the wild and to catch regressions before they become incidents.</p></li><li><p><strong>Maintain:</strong> The lifecycle doesn&#8217;t end at deployment. Based on monitoring data and new business requirements, prompts are versioned, improved, or gracefully retired. This is the feedback loop that ensures your system evolves and adapts over time.</p></li></ol><p>This five-stage loop transforms your process from a linear, fire-and-forget task into a sustainable, continuous improvement cycle. It&#8217;s the engineering foundation for building with AI, not just dabbling in it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-mv5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-mv5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!-mv5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!-mv5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!-mv5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-mv5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:285240,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/187189335?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-mv5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!-mv5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!-mv5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!-mv5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd99c323b-f27f-43c5-b7fa-46bd3f17261b_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p><em><strong>Today&#8217;s issue breaks down the engineering lifecycle for production prompts. It&#8217;s one piece of a much bigger puzzle: building production AI systems that actually work.</strong></em></p><p><em><strong>If you&#8217;re looking for a structured and hands-on way to step into AI engineering, the Engineer&#8217;s RAG Accelerator is for you. Check it out here:</strong></em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.neosage.io/&quot;,&quot;text&quot;:&quot;The Engineer's RAG Accelerator&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.neosage.io/"><span>The Engineer's RAG Accelerator</span></a></p><p><em><strong>Now, let&#8217;s get back to our prompt system.</strong></em></p><div><hr></div><h2>Design &amp; Development: Crafting Maintainable Prompts</h2><p>The Prompt Lifecycle begins with Design. And let&#8217;s be clear: this isn&#8217;t a creative writing exercise; it&#8217;s an architectural one. We&#8217;re going to transform a brittle string into a robust software artifact by giving it a formal, enforceable structure.</p><p>This structure has three essential components.</p><h3><strong>Component 1: Decouple with Templates</strong></h3><p>First, we kill the &#8220;String-in-Code&#8221; anti-pattern by separating the prompt&#8217;s static logic from its dynamic data. A template engine like <strong><a href="https://jinja.palletsprojects.com/">Jinja2</a></strong> is the standard tool for this job. It lets you build prompts that contain logic (for example, in the code below &#8220;if output_format is &#8216;detailed&#8217; then request these fields&#8221;), while the application code is <em>only</em> responsible for providing the data (such as support_ticket_text, output_format).</p><p>This is a clean separation of concerns. The application handles <em>what</em> data to send; the template handles <em>how</em> that data is presented to the model.</p><h3><strong>Component 2: Define a Formal Schema</strong></h3><p>Next, we elevate the prompt from a loose text file to a true, self-describing artifact. We do this by defining it in a structured <code>YAML </code>file, which bundles the template itself with a rich set of auditable metadata.</p><p>This schema is the canonical definition of your prompt. It&#8217;s the manifest. You can optionally enforce it with validation libraries like Pydantic to guarantee that every prompt in your system is a well-defined, predictable asset. This YAML file becomes the single source of truth for your prompt&#8217;s structure and requirements.</p><p>A professional prompt definition looks like this.</p><pre><code><code># summarize_ticket.v1.yaml
# &#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;
# METADATA: Describes the prompt artifact
# &#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;
name: SummarizeSupportTicket
description: "Generates a concise summary of a user support ticket for internal review."
version: 1
tags: ['support', 'summarization']

input_variables:
  - support_ticket_text
  - output_format # Can be "detailed" or "summary"

execution_settings:
  model: "claude-3-opus-20240229"
  temperature: 0.5
  max_tokens: 512

# &#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;
# TEMPLATE: The actual Jinja2 prompt logic
# &#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;
template: |
  Your task is to extract information from the following support ticket.
  Support Ticket: {{ support_ticket_text }}

  {% if output_format == "detailed" %}
  Please format the output as a detailed JSON object with the following keys: 'ticketId', 'customerEmail', 'submittedAt', 'productTier', and 'fullIssueDescription'.
  {% elif output_format == "summary" %}
  Please format the output as a compact JSON object with only the following keys: 'ticketId' and 'issueSummary'.
  {% endif %}
</code></code></pre><p>Suddenly, your prompt isn&#8217;t a guess; it&#8217;s a <em>specification</em>. It has a name, a version, explicit inputs, and the exact model settings it was tested against.</p><h3><strong>Component 3: Establish Version Control</strong></h3><p>Finally, and this part is non-negotiable: these <code>.yaml</code> files are committed to <strong>Git.</strong></p><p>This gives your prompts the same safety net you have for every other critical piece of your infrastructure: a complete audit trail (<code>git blame</code>), safe rollbacks (<code>git revert</code>), and a clear comparison between versions (<code>git diff</code>). If your prompt isn&#8217;t in version control, it doesn&#8217;t exist.</p><p>With these three components, you&#8217;re no longer wrestling with a hardcoded string. You have a structured, versioned, and auditable software artifact.</p><p>Now, let&#8217;s go prove it actually works.</p><h2>Testing &amp; Evaluation: From "Looks Good" to "Provably Good"</h2><p>Hope is not a testing strategy.</p><p>Now that you have a structured, versioned artifact, how do you prove it&#8217;s any good? In conventional engineering, the answer is a test suite. For prompts, the discipline is the same, but the methods are new. It&#8217;s time to move from a subjective &#8220;looks good to me&#8221; spot-check to a rigorous process that <em>proves</em> your prompt&#8217;s quality with data.</p><p>The most effective way to do this is to adopt an <strong>Evaluation Maturity Model.</strong> Think of it as a three-level roadmap, starting with a simple foundation and building towards a state of automated, runtime guarantees. As the &#8220;Prompt Evaluation Pyramid&#8221; shows, each level provides a new layer of confidence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qquO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qquO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!qquO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!qquO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!qquO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qquO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:273658,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/187189335?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qquO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!qquO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!qquO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!qquO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F481a5f6c-7104-48c6-9446-261e3f01436a_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Level 1: The Foundation - Curated Golden Datasets</strong></h3><p>This is where every professional testing strategy begins. You create a <strong>&#8220;golden dataset&#8221;</strong>: a curated collection of diverse, representative inputs and their corresponding, ideal outputs. These are the canonical benchmark for your prompt. When you draft a new version, you run it against the golden inputs and compare the model&#8217;s output to your ideal answer.</p><p><strong>The Golden Dataset Reality:</strong> Golden datasets require knowing what &#8220;correct&#8221; means for your specific task.</p><ul><li><p><strong>Tasks with exact answers:</strong> Easy to verify<br>Examples: Classification (&#8220;Is this spam?&#8221; &#8594; Yes), extraction (&#8220;Pull order_id from receipt&#8221; &#8594; &#8220;12345&#8221;)</p></li><li><p><strong>Tasks with quality criteria:</strong> Need defined rules<br>Examples: Summarization (&#8220;under 100 words, captures main points&#8221;), rewrites (&#8220;professional tone, preserves facts&#8221;)</p></li><li><p><strong>Tasks depending on context:</strong> Hardest to evaluate<br>Examples: Customer support replies (tone varies by sentiment), recommendations (depend on user history)</p></li></ul><p>This is your essential safety net for catching regressions. Skip this and you have no way to measure if your changes help or hurt.</p><h3><strong>Level 2: Automation - Metric-Based Evaluation</strong></h3><p>Now you have a golden dataset. Automate the comparison. Evaluation frameworks like <strong><a href="https://docs.ragas.io/">Ragas</a></strong> or <strong><a href="https://deepeval.com/">DeepEval</a></strong> run your prompts against your golden dataset and calculate quantitative scores using different metric types:</p><ul><li><p><strong>Deterministic:</strong> Exact match checks</p><p>Fast, catches obvious failures.<br>Example: Does output[&#8220;order_id&#8221;] match expected? Is JSON structure valid?</p></li><li><p><strong>Semantic:</strong> Meaning match when wording varies</p><p>Works for summarization, Q&amp;A, any tasks where meaning matters more than exact wording.<br>Example: &#8220;Meeting is on Monday&#8221; vs &#8220;Monday is the meeting date&#8221; - same meaning, different words. Embedding similarity scores this.</p></li><li><p><strong>LLM-as-a-Judge:</strong> Subjective quality (tone, helpfulness, conciseness)</p><p>Example: Score &#8220;Is this professional?&#8221; for customer emails.<br>Warning: LLM judges are biased (prefers longer outputs, own model family). Trust output as signal, not truth.</p></li></ul><h3><strong>Level 3: The Guarantee - Runtime Verification</strong></h3><p>Enforce output structure at runtime, before it hits your downstream systems.</p><p>Libraries like <strong><a href="https://github.com/jxnl/instructor">Instructor</a></strong> integrate with <strong><a href="https://docs.pydantic.dev/">Pydantic</a></strong>. You define your output schema as a Pydantic model. Instructor forces the LLM output to conform to that schema, acting as a gatekeeper. If validation fails, it re-prompts with the error.</p><p>This ensures your application only receives valid, structured output. No more hoping for clean JSON - you guarantee it.</p><p>By progressing through these three levels, you transform your evaluation process from a hopeful guess into an engineering discipline. You build a system to <em>prove</em> that your prompts work, every time.</p><h2>Deployment &amp; Monitoring: Shipping and Observing Prompts in the Wild</h2><p>So, your prompt artifact passed every test in the lab. What happens when you throw it into the chaos of a real production environment?</p><p>A passing test suite proves your prompt works. It doesn&#8217;t prove it survives production. Models drift without warning. Users send unexpected inputs. Latency spikes. This section covers the infrastructure you need: deployment pipelines, monitoring systems, and rollback controls.</p><p>This system has four key components.</p><h3><strong>Component 1: The Prompt Registry</strong></h3><p>First, you need a single source of truth. A <strong>Prompt Registry</strong> is a centralized, versioned repository for your validated prompt artifacts. Instead of your application reading a local <code>.yaml</code> file, it fetches the prompt directly from this registry at runtime.</p><p>This is a critical decoupling step. It means you can update a prompt without having to redeploy your entire application. Tools like <strong>LangSmith</strong> or <strong>PromptLayer</strong> provide managed registries, but you can also build one with a simple web framework (FastAPI with a database like PostgreSQL). The principle is what matters: a centralized service that serves versioned prompts over an API. Your application code asks for <code>SummarizeSupportTicket:v2</code>, and the registry delivers it.</p><h3><strong>Component 2: The CI/CD Pipeline for Prompts</strong></h3><p>With a registry in place, you can automate deployment. This is the CI/CD pipeline for prompts.</p><p>On commit to a prompt file, your pipeline runs the tests from Section 4. If they pass, it publishes the validated artifact to your Prompt Registry. If they fail, the change is blocked. No prompt reaches production without passing evaluation.</p><p>This decouples prompt updates from application deployments. Each can evolve independently.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XUZK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XUZK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!XUZK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!XUZK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!XUZK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XUZK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:166439,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/187189335?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XUZK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!XUZK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!XUZK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!XUZK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb68a5cb-f643-4ac0-9724-cf096076eaea_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Component 3: The Observability Framework</strong></h3><p>Shipping is not the end of the journey. Once your prompt is live, you need eyes on it. An <strong>Observability Framework</strong> gives you a real-time dashboard to answer critical questions about your prompt&#8217;s performance in the wild.</p><p>Using emerging observability frameworks like <strong><a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/">OpenTelemetry</a></strong>, you can track key metrics for every prompt execution:</p><ul><li><p><strong>Performance:</strong> What is the end-to-end latency?</p></li><li><p><strong>Cost:</strong> How many tokens is this version using? Is it bleeding your budget?</p></li><li><p><strong>Quality:</strong> What are its real-world quality scores? Are you seeing a drift or regression in performance?</p></li><li><p><strong>User Feedback:</strong> What&#8217;s your approval rate? Are users flagging bad outputs?</p></li></ul><p>Without this data, you&#8217;re flying blind. With it, you can spot regressions before they become incidents and make data-driven decisions about which prompts to optimize or retire.</p><h3><strong>Component 4: Safe Rollouts with A/B Testing</strong></h3><p>Finally, a mature system <em>never</em> rolls out a new prompt version to 100% of users at once. You de-risk the deployment by using <strong>A/B testing.</strong></p><p>By integrating your application with a feature flagging tool, you can configure it to fetch different versions of a prompt for different user segments. For example:</p><ul><li><p>90% of users get the trusted <code>v1</code> of <code>SummarizeSupportTicket</code>.</p></li><li><p>10% of users get the new <code>v2</code>.</p></li></ul><p>You then compare the observability data for both versions side-by-side. If <code>v2</code> is cheaper, maintains quality, and user feedback stays positive, you can gradually roll it out to all users. If it causes a spike in errors, you can kill the feature flag instantly, rolling everyone back to <code>v1</code> without a single line of code being deployed. This is how you iterate with confidence, not just hope.</p><h2>Builder's Takeaway: From Prompt Janitor to Systems Architect</h2><p>A 3 AM PagerDuty alert from a hardcoded string. An autonomous system that optimizes its own prompts. That&#8217;s the gap this playbook bridges.</p><p>That transformation requires rethinking your role. You&#8217;re not a prompt writer. You&#8217;re a systems architect.</p><p>The prompt string is a disposable implementation detail. The valuable asset is the infrastructure around it: the system that can test, deploy, and monitor prompts at scale.</p><p>So here&#8217;s a heuristic for your next code review: treat every hardcoded prompt as a high-severity bug. This week, find one critical prompt living as a raw string in your codebase. Give it a home in a version-controlled .yaml file. Write one golden test case for it.</p><p>That first step establishes the foundation for a production-grade system.</p><p>The future of AI engineering is building systems that manage prompts, not perfecting individual prompts.</p><div class="pullquote"><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WfaS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WfaS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!WfaS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!WfaS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!WfaS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WfaS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png" width="402" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:402,&quot;bytes&quot;:1793174,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/187189335?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WfaS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!WfaS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!WfaS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!WfaS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aaf8677-9dd3-463a-90a6-1a5ee0dd9c6e_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That one &#8216;perfect&#8217; prompt you spent a week on? It doesn&#8217;t scale. The system you should have built does.</p><p>Stay Dangerous. Hoot.</p></div><p><em>If this issue changed how you think about prompts in production, drop a heart or leave a comment. That&#8217;s the only way I know this landed.</em></p><p><em>And if you&#8217;ve been wanting to go deeper than newsletters can take you, to actually build, evaluate, and deploy production AI systems with a structured curriculum and a community of senior engineers... <strong>that&#8217;s what the Engineer&#8217;s RAG Accelerator is for.</strong></em></p><p><em><strong>6 weeks. Hands-on. Learn alongside engineers from Microsoft, Amazon, Adobe, Visa, and more (that&#8217;s where our previous cohort&#8217;s engineers came from).</strong></em></p><p><em>Our last cohort sold out in a week. The waitlist for the next cohort is open. Join now for early bird access when enrollment opens.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.neosage.io/&quot;,&quot;text&quot;:&quot;The Engineer's RAG Accelerator&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.neosage.io/"><span>The Engineer's RAG Accelerator</span></a></p><h2>References &amp; Further Reading</h2><p>Ready to go deeper? Here are the tools and frameworks to continue your journey from prompt writer to systems architect.</p><h3>Core Tooling &amp; Templating</h3><ul><li><p><a href="https://jinja.palletsprojects.com/">Jinja2</a>: For prompt templating.</p></li><li><p><a href="https://docs.pydantic.dev/">Pydantic</a>: For data validation and defining schemas.</p></li><li><p><a href="https://github.com/jxnl/instructor">Instructor</a>: For getting structured, validated output from LLMs.</p></li></ul><h3><strong>Evaluation Frameworks</strong></h3><ul><li><p><a href="https://docs.ragas.io/">Ragas</a>: For LLM evaluation, particularly in RAG systems.</p></li><li><p><a href="https://deepeval.com/">DeepEval</a>: A pytest-like LLM evaluation framework.</p></li></ul><h3><strong>Observability &amp; Management</strong></h3><ul><li><p><a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/">OpenTelemetry</a>: Emerging open standard for LLM observability.</p></li></ul><h3><strong>Advanced Patterns</strong></h3><ul><li><p><a href="https://github.com/stanfordnlp/dspy">DSPy</a>: For programmatic, self-optimizing prompting.</p></li></ul>]]></content:encoded></item><item><title><![CDATA[The AI Engineer's Roadmap for 2026]]></title><description><![CDATA[It's been a while. I'm back with a clear plan for overwhelmed engineers. And the most important thing I've ever built for you.]]></description><link>https://blog.neosage.io/p/the-ai-engineers-roadmap-for-2026</link><guid isPermaLink="false">https://blog.neosage.io/p/the-ai-engineers-roadmap-for-2026</guid><dc:creator><![CDATA[Shivani Virdi]]></dc:creator><pubDate>Fri, 09 Jan 2026 19:38:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DLzj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you&#8217;re an experienced software engineer looking at the AI landscape today, you&#8217;re probably feeling overwhelmed. It&#8217;s chaos. Every day brings a new model, a new framework, a dozen new tools, and a thousand conflicting opinions on social media. The pressure to &#8220;upskill or get left behind&#8221; is immense, but the path forward is buried in noise.</p><p>It all boils down to one, paralysing question: &#8220;Amid all this, where do I even begin?&#8221;</p><p>That question has been on my mind a lot. And I know, it&#8217;s been a while since we last spoke.</p><p>The truth is, giving you a real answer to that question required more than just another newsletter. As a solo founder managing this project, I realised that to provide a true, structured path out of the chaos, I couldn&#8217;t just write about it. I had to go away and actually build it.</p><p>That&#8217;s what I&#8217;ve been doing these past few months, pouring all my energy into building something bigger, something I believe is the most valuable answer I can give you.</p><p>So today, I&#8217;m back. And I promise this issue will more than make up for the silence.</p><p>It starts with a clear plan.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.neosage.io&quot;,&quot;text&quot;:&quot;The Engineer's RAG Accelerator&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.neosage.io"><span>The Engineer's RAG Accelerator</span></a></p><h3><strong>A Roadmap for Clarity</strong></h3><p>The feeling of being overwhelmed is a symptom of not having a map. This is that map.</p><p>This is the no-hype, no-shortcuts, 12-week plan I would give to any experienced software engineer who wants to stop chasing trends and start building real AI systems.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DLzj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DLzj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!DLzj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!DLzj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!DLzj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DLzj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1980397,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/184043082?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DLzj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!DLzj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!DLzj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!DLzj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f0a83c-52ec-471c-a7cd-0272177c5fc3_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>WEEKS 1-2: Foundations</strong></p><ul><li><p>How LLMs actually work: pretraining, post-training, and why they hallucinate</p></li><li><p>RAG architecture and core components</p></li><li><p>How to identify, qualify, and define your RAG project</p></li><li><p>Setting up your production stack: vector database, orchestration framework, LLM API</p></li><li><p>Build your first end-to-end pipeline</p></li></ul><blockquote><p>You&#8217;re not optimizing yet. You&#8217;re building intuition.</p></blockquote><p><strong>WEEKS 3-4: Chunking, Embeddings, and Your First Evaluation</strong></p><ul><li><p>Why chunking is your most important decision</p></li><li><p>Embeddings deep dive: how text becomes vectors</p></li><li><p>Chunking strategies: fixed-size, semantic, AST-based for code</p></li><li><p>Introduction to evaluation: LLM-as-a-Judge and human review</p></li><li><p>Hands-on: systematically compare strategies and find a winner</p></li></ul><p><strong>WEEKS 5-6: Advanced Retrieval Architectures</strong></p><ul><li><p>Vector database internals: how HNSW and ANN algorithms work</p></li><li><p>Dense vs Sparse retrieval: solving the coverage problem</p></li><li><p>Hybrid retrieval with Reciprocal Rank Fusion (RRF)</p></li><li><p>Reranking: bi-encoders vs cross-encoders</p></li><li><p>Two-stage retrieval: LLM routing for precision</p></li><li><p>The CAL Framework: Cost-Accuracy-Latency tradeoffs</p></li></ul><p><strong>WEEKS 7-8: Mastering Evaluation</strong></p><blockquote><p>Evaluation is harder than building RAG.</p></blockquote><ul><li><p>Why RAG evaluation isn&#8217;t like traditional ML</p></li><li><p>Synthetic test set generation with RAGAS</p></li><li><p>LLM-as-a-Judge evaluation with DeepEval</p></li><li><p>Bootstrapped Golden Datasets: creating ground truth</p></li><li><p>Choosing the right evaluation strategy for each iteration</p></li></ul><blockquote><p>No evaluation = shipping blind.</p></blockquote><p><strong>WEEKS 9-10: Production Engineering</strong></p><blockquote><p>Your system works. Now make it fast, reliable, and cost-effective.</p></blockquote><ul><li><p>Production RAG architecture: latency, cost, observability</p></li><li><p>Semantic caching: achieving 1000x+ speedup on repeat queries</p></li><li><p>Production backend with FastAPI and streaming responses</p></li><li><p>Integrating observability and tracing</p></li><li><p>Smart retries, adaptive prompting, cache invalidation</p></li></ul><p><strong>WEEKS 11-12: Advanced Patterns and Deployment</strong></p><blockquote><p>Evolve from a RAG system to an intelligent application.</p></blockquote><ul><li><p>Beyond basic RAG: Cache Augmented Generation and Agentic architectures</p></li><li><p>Advanced query understanding: expansion, decomposition, multi-step RAG</p></li><li><p>Dynamic retrieval: query routing and RAG-as-a-pluggable-tool</p></li><li><p>Production deployment: Docker, cloud platforms, security</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.neosage.io&quot;,&quot;text&quot;:&quot;The Engineer's RAG Accelerator&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.neosage.io"><span>The Engineer's RAG Accelerator</span></a></p><h3><strong>The Starting Point: Why We Start with RAG</strong></h3><p>That&#8217;s the 12-week roadmap. It&#8217;s comprehensive, and looking at it, you might still feel like there&#8217;s a lot to learn. Still a lot, right?</p><p>The key isn&#8217;t to start everywhere at once. It&#8217;s to find the single point of maximum leverage: the one skill that unlocks the rest.</p><p>For any modern AI stack, that point is Retrieval-Augmented Generation.</p><p>This might seem counterintuitive. Isn&#8217;t RAG just for chatbots? Isn&#8217;t it just one small piece of that big roadmap?</p><p>No. That&#8217;s the great misunderstanding. RAG is not just a feature for Q&amp;A; it&#8217;s the <strong>fundamental design pattern</strong> that makes almost every other advanced AI system function.</p><p>Let me show you. Once you see the pattern, you can&#8217;t unsee it:</p><ul><li><p><strong>Agentic Tool Use:</strong> How does an AI agent decide which of hundreds of tools to use? It performs retrieval. The user&#8217;s query is embedded and used to search a vector database of all available tool descriptions. The top matching tools and their API schemas are then retrieved and provided as context in the prompt, giving the LLM the exact information it needs to make a correct function call.</p></li><li><p><strong>Long-Term Memory:</strong> When an assistant seems to &#8220;remember&#8221; you, it&#8217;s because your past conversations have been chunked, embedded, and stored in a vector database. When you speak, the system isn&#8217;t just looking at your last few messages; it performs a semantic search to retrieve the most relevant prior exchanges from weeks ago, giving the LLM a rich, long-term context.</p></li><li><p><strong>Structured Data Access:</strong> A &#8220;Text-to-SQL&#8221; copilot doesn&#8217;t understand your entire database schema; it would drown in thousands of tables. Instead, it uses the user&#8217;s question to retrieve only the most relevant table schemas, column descriptions, foreign key relationships, and even sample rows. This curated &#8220;micro-schema&#8221; is then injected into the prompt, giving the LLM the exact context it needs to write an accurate query.</p></li><li><p><strong>Adaptive Prompting:</strong> Even sophisticated prompting is often RAG in disguise. Instead of a static, hard-coded few-shot prompt, the system maintains a large library of high-quality examples. At runtime, it retrieves the examples that are most semantically similar to the user&#8217;s query to dynamically assemble the perfect prompt for the task at hand.</p></li></ul><p>Retrieval is the mechanism we use to connect a static, pre-trained model to a dynamic, external world.</p><p>This is why we start with RAG. Mastering it doesn&#8217;t just teach you how to build a Q&amp;A bot. It teaches you the core system design for applied AI. It is the highest-return investment of your time.</p><h3><strong>The First Great Challenge: &#8220;The Production Gap&#8221;</strong></h3><p>So, you&#8217;re convinced. RAG is the highest-leverage skill, the engine that powers modern AI. You decide to start there. You pick up a tutorial, copy the code, get a demo working with a sample PDF, and it feels like magic.</p><p>Then you point it at your own, real-world, messy data, and the magic vanishes. Your system breaks.</p><p>This is <strong>The Production Gap</strong>: the vast chasm between a tidy tutorial and a messy, production reality. The reason for this gap is that tutorials present RAG as a simple pipeline. Production RAG is not a pipeline; it&#8217;s a <strong>system of interacting layers</strong>, each with its own complex decisions.</p><p>Think of it like any other production system you&#8217;ve built. There&#8217;s a stack, and every layer matters:</p><p><strong>1. The Data Layer:</strong> This is your foundation, and its quality dictates the performance of every other layer downstream.</p><ul><li><p><strong>Chunking:</strong> How do you break up your documents? If your chunk size is too small, the full context for an answer might be split across multiple, disconnected chunks. If it&#8217;s too large, you introduce too much noise for the retriever. Getting this wrong means the correct context is fundamentally impossible to retrieve in a single step.</p></li><li><p><strong>Embeddings:</strong> Which model do you use? Each one creates a completely different &#8220;meaning space.&#8221; Changing your embedding model later isn&#8217;t a simple swap; it requires re-indexing your entire knowledge base, creating significant architectural lock-in.</p></li></ul><p><strong>2. The Retrieval Layer:</strong> This is the algorithmic core that surfaces information. The first lesson here is that <em>semantic similarity doesn&#8217;t always mean relevance</em>.</p><ul><li><p><strong>Retrieval Strategy:</strong> Do you use dense search for meaning, sparse search for keywords, or a hybrid approach to get the best of both?</p></li><li><p><strong>Metadata Filtering:</strong> Do you enrich your vectors with metadata (like dates, sources, or customer IDs)? This allows you to apply hard filters <em>before</em> the semantic search, making retrieval more reliable and deterministic: for example, ensuring you only retrieve documents from &#8216;Q4 2025&#8217; or for &#8216;Customer X&#8217;.</p></li><li><p><strong>Scoring &amp; Reranking:</strong> How do you score the results? Do you rely purely on vector similarity, or create custom <strong>scoring profiles</strong> that boost results based on recency or business rules? Do you add a second-stage reranker to improve precision?</p></li></ul><p><strong>3. The Orchestration &amp; Generation Layer:</strong> This is the brain of your system, coordinating the other layers.</p><ul><li><p><strong>Query Handling:</strong> Do you use the user&#8217;s query as-is, or does your orchestration decompose a complex question into multiple sub-queries?</p></li><li><p><strong>Prompt Engineering:</strong> How do you structure the prompt to force the LLM to <em>actually use</em> the provided context, especially when it&#8217;s noisy or contradictory? Which LLM do you use?</p></li><li><p><strong>Interaction Pattern:</strong> Is it a single-shot process, or a multi-step one where you retrieve, generate, and then retrieve again?</p></li></ul><p><strong>4. The Evaluation Layer:</strong> This is the most critical and most often missing layer, the system&#8217;s feedback loop.</p><ul><li><p>It answers the core question: How do you know if a change to your chunking strategy made things better or worse? Without a robust evaluation layer, you are flying blind. It&#8217;s what separates professional systems from amateur demos and involves building test sets, using LLM-as-a-Judge, and running regression tests to prevent silent failures.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.neosage.io&quot;,&quot;text&quot;:&quot;The Engineer's RAG Accelerator&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.neosage.io"><span>The Engineer's RAG Accelerator</span></a></p><p>Each of these isn&#8217;t just a one-time choice; it&#8217;s a decision with deep, coupled implications. This is the engineering rigour required. It&#8217;s not about finding one magic combination; it&#8217;s about understanding and navigating the tradeoffs at every layer of the stack. This is the hard part of AI that no one talks about.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EaaO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EaaO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png 424w, https://substackcdn.com/image/fetch/$s_!EaaO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png 848w, https://substackcdn.com/image/fetch/$s_!EaaO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png 1272w, https://substackcdn.com/image/fetch/$s_!EaaO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EaaO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png" width="800" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248683,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/184043082?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EaaO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png 424w, https://substackcdn.com/image/fetch/$s_!EaaO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png 848w, https://substackcdn.com/image/fetch/$s_!EaaO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png 1272w, https://substackcdn.com/image/fetch/$s_!EaaO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbade76f-49a9-43a2-804c-2e1367cd90d3_800x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p>Hoo, Sagers.</p><p>Your Owlthor&#8217;s been a bit... <em>restless</em> to face you after an unscheduled sabbatical. She told me to tell you she&#8217;s not thrilled about the break either, but don&#8217;t tell her I said this: it&#8217;s not her fault. She&#8217;s been nose-deep in something far bigger, something much more valuable for all of you builders. That&#8217;s why she couldn&#8217;t show up. Today, though (and she doesn&#8217;t know I&#8217;m saying this part yet), she <em>will</em> be sharing what she&#8217;s been working on.</p><p>Be nice, Sagers. The Owlthor is a solo founder, and frankly, she needs the coffee.</p><p>&#8212; Nocto</p></div><h3><strong>The Part Everyone Skips: Evaluation</strong></h3><p>Now, what&#8217;s the single most overlooked piece of that complex RAG system? The part that determines whether you&#8217;re building a reliable product or a demo that just <em>feels</em> right?</p><p>It&#8217;s evaluation.</p><p>The single hardest part of building production-ready RAG is evaluation. And it&#8217;s also the part that almost every course and tutorial ignores.</p><p>The tutorials assume you have a nice, clean, labelled dataset to test against. A perfect ground truth. But in the real world, you&#8217;re starting with a messy pile of documents and a stream of user questions. You have no ground truth. No labels. No test set.</p><p>So you have to build it yourself.</p><p>This is why true RAG mastery means constructing your own evaluation frameworks from scratch, using powerful techniques like:</p><ul><li><p><strong>Synthetic test sets:</strong> Using LLMs to generate realistic question-answer pairs directly from your documents. It&#8217;s fast, but you need to understand its blind spots.</p></li><li><p><strong>LLM-as-a-Judge:</strong> Employing a powerful model to objectively score your system&#8217;s outputs. It&#8217;s incredibly useful, but you need to be aware of its biases.</p></li><li><p><strong>Bootstrapped &#8220;golden datasets&#8221;:</strong> Starting with a small set of manually curated, perfect examples and then strategically expanding it to create a reliable, evolving ground truth for your specific domain. This is slow, but essential.</p></li></ul><p>The maxim is simple: If you can&#8217;t measure it, you can&#8217;t improve it. In RAG, building the measurement system is half the work. Without it, you are shipping blind.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.neosage.io&quot;,&quot;text&quot;:&quot;The Engineer's RAG Accelerator&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.neosage.io"><span>The Engineer's RAG Accelerator</span></a></p><h3><strong>The Solution: Depth Over Breadth</strong></h3><p>So, how do you conquer that roadmap and cross the Production Gap without quitting your job and spending years on trial and error?</p><p>You&#8217;ve seen the chaos of the AI landscape. You&#8217;ve seen the layers of complexity involved in building a single, production-grade RAG system.</p><p>The answer isn&#8217;t to learn a little bit about 100 different AI tools. That&#8217;s just more noise, more overwhelm.</p><p>The answer is to master one, fundamental system so deeply that you gain the confidence and intuition to tackle any problem. The answer is <strong>depth over breadth.</strong></p><p>This philosophy is why I paused the newsletter. I poured all my energy into building the one thing I believe is the ultimate structured path for an experienced engineer to break into AI: <strong>The Engineer&#8217;s RAG Accelerator.</strong></p><p>This is a hands-on, 6-week accelerator designed to take you, an experienced software engineer, from feeling overwhelmed to building your first production-grade AI system with confidence.</p><p>It was built from the ground up to solve your biggest challenges:</p><ul><li><p><strong>&#8220;How can I master AI without quitting my day job?&#8221;</strong> The program is self-paced and designed for a <strong>6-8 hour/week commitment</strong>, so you can master this new skill while effectively balancing your day job.</p></li><li><p><strong>&#8220;What if I&#8217;m new to AI?&#8221;</strong> This cohort starts with core LLM fundamentals and RAG architecture, building your intuition from the ground up. <strong>If you have solid software engineering experience, you have all the prerequisites.</strong></p></li><li><p><strong>&#8220;I get stuck following tutorials by myself.&#8221;</strong> You won&#8217;t be alone. You get <strong>weekly 1-hour live Q&amp;A sessions</strong> with me and <strong>daily support in our private cohort chat</strong> to get unstuck fast and learn from your peers.</p></li><li><p><strong>&#8220;How do I build something that actually ships?&#8221;</strong> Most tutorials stop at &#8216;hello world&#8217;. We equip you with an <strong>industry-grade tech stack</strong> (Haystack, Qdrant, Gemini, Redis, FastAPI, Streamlit, Opik) and <strong>production-ready code templates</strong> so you can build and deploy applications that work.</p></li><li><p><strong>&#8220;How do I know if my system is actually working?&#8221;</strong> We go deep on the part everyone else skips: evaluation. You will learn to build your own evaluation systems from scratch using <strong>RAGAS, DeepEval, and bootstrapped golden datasets.</strong></p></li><li><p><strong>&#8220;I need a real project for my portfolio.&#8221;</strong> You won&#8217;t just learn; you will <strong>build and deploy your own unique capstone RAG system</strong>, giving you a real-world project to showcase your new expertise.</p></li><li><p><strong>&#8220;Will this be outdated in a year?&#8221;</strong> You get <strong>lifetime access</strong> to all course materials, code, and all future updates, ensuring this investment continues to pay off as the industry evolves.</p></li></ul><p>I launched this to a small waitlist, and 65% of the 50 seats were taken in under 48 hours.</p><p>There are a <strong>few seats left, filling fast</strong></p><p>I&#8217;m opening them now to you, my newsletter subscribers, first.</p><p>If you are an experienced software engineer who is tired of the hype and ready for a structured, hands-on path to building real, production-grade AI systems, this is for you.</p><p>You can learn more and claim one of the remaining spots here:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.neosage.io&quot;,&quot;text&quot;:&quot;The Engineer's RAG Accelerator&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.neosage.io"><span>The Engineer's RAG Accelerator</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sfvF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sfvF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png 424w, https://substackcdn.com/image/fetch/$s_!sfvF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png 848w, https://substackcdn.com/image/fetch/$s_!sfvF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png 1272w, https://substackcdn.com/image/fetch/$s_!sfvF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sfvF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png" width="1456" height="821" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:821,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:13572000,&quot;alt&quot;:&quot;https://academy.neosage.io&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/184043082?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="https://academy.neosage.io" title="https://academy.neosage.io" srcset="https://substackcdn.com/image/fetch/$s_!sfvF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png 424w, https://substackcdn.com/image/fetch/$s_!sfvF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png 848w, https://substackcdn.com/image/fetch/$s_!sfvF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png 1272w, https://substackcdn.com/image/fetch/$s_!sfvF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920b8c46-80af-4a62-aefd-4123c9cc8ed8_4096x2310.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It feels good to be back. <br>Got questions? Hit me up!</p><p>Hope to see you inside.</p>]]></content:encoded></item><item><title><![CDATA[The Illusion of Illusion... of AI?]]></title><description><![CDATA[A Builder's front row seat to the AI Reasoning Wars and what the Apple vs. Anthropic debate really means for those of us who build.]]></description><link>https://blog.neosage.io/p/the-illusion-of-illusion-of-ai</link><guid isPermaLink="false">https://blog.neosage.io/p/the-illusion-of-illusion-of-ai</guid><dc:creator><![CDATA[Shivani Virdi]]></dc:creator><pubDate>Mon, 07 Jul 2025 11:56:07 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/000a8d6c-af2b-4b86-a142-fea7e2eabf28_2520x1800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the middle of the generative AI arms race, where every week, someone drops a new model with more neurons, more tokens, more &#8220;look what it can do!&#8221;, Apple stepped in.</p><p>Not with a model.<br>Not with an API.<br>But with a research paper.</p><p>And not just any paper, one with the wonderfully spicy title: <strong>&#8220;The Illusion of Thinking.&#8221;</strong></p><p>Catchy. Slightly theatrical. And guaranteed to set AI Twitter on fire.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.neosage.io/subscribe?"><span>Subscribe now</span></a></p><p>The paper doesn&#8217;t tiptoe around its point. It goes straight for the throat, claiming:</p><ul><li><p>That today&#8217;s most advanced language models, yep, the ones acing benchmark leaderboards, aren&#8217;t actually <em>reasoning</em>. They&#8217;re just next-level pattern matchers pulling off a convincing magic trick.</p></li><li><p>That these models hit a <strong>hard ceiling</strong>. As soon as a puzzle crosses a certain complexity threshold, accuracy doesn&#8217;t just drop, it collapses. Like, <em>falls-off-a-cliff</em> collapses.</p></li><li><p>And most intriguingly, that these models, when faced with problems they <em>could</em> technically solve, just... give up. They don&#8217;t even try. They reduce their &#8220;thinking effort&#8221; and tap out early, <em>despite having the token budget left to finish the job</em>.</p></li></ul><p>Needless to say, the media went bananas.</p><p>(Bananas for Apple. C&#8217;mon, you walked right into that one.)</p><p>Here&#8217;s how the headlines read:</p><ul><li><p>&#8220;Apple Researchers Just Released a Damning Paper That&#8230;&#8221;</p></li><li><p>&#8220;Advanced AI suffers &#8216;complete accuracy collapse&#8217;&#8221;</p></li><li><p>&#8220;Apple says generative AI cannot think like a human&#8221;</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NTN_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NTN_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png 424w, https://substackcdn.com/image/fetch/$s_!NTN_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png 848w, https://substackcdn.com/image/fetch/$s_!NTN_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png 1272w, https://substackcdn.com/image/fetch/$s_!NTN_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NTN_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1944261,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/167664737?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NTN_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png 424w, https://substackcdn.com/image/fetch/$s_!NTN_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png 848w, https://substackcdn.com/image/fetch/$s_!NTN_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png 1272w, https://substackcdn.com/image/fetch/$s_!NTN_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85ccaa16-15e3-47b9-878b-3a1d29a5013c_3600x3600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That last one? XD.</p><p>I mean, if you thought your LLM could &#8220;think like a human,&#8221; that&#8217;s on <em>you</em>, my friend.</p><p>But let&#8217;s be real: headlines are for clicks, not clarity.</p><p>So let&#8217;s skip the spin, pour yourself a coffee (or something stronger, Nocto&#8217;s not judging), and dig into what the paper <em>actually</em> shows. Then we&#8217;ll see whether the claims hold up under real engineering scrutiny, or if this is just another case of academic dramatics.</p><h1><strong>Part 1: Apple's Case &#8211; Are LRMs Just Simulating Intelligence?</strong></h1><p>So, how exactly did Apple arrive at these claims?</p><p>To their credit, this wasn&#8217;t just another leaderboard stunt. The team behind the paper set out to rigorously test the reasoning capabilities of Large Language Models (LLMs) and Large Reasoning Models (LRMs) using a controlled, deliberately designed setup that avoids common benchmarking pitfalls. Here's a breakdown of how they approached the problem and what they found.</p><h2><strong>The Engineer&#8217;s Read: The Basis of the Claims</strong></h2><p>At the heart of the paper is a critique of existing reasoning benchmarks and a proposed alternative for evaluating reasoning in a more isolated, measurable way.</p><h3><strong>The Problem with Benchmarks</strong></h3><p>The authors argue that most benchmarks used to test reasoning (like math word problems or competitive coding datasets) are flawed due to <strong>data contamination</strong>; in other words, models may have seen similar examples during pretraining, making it unclear whether they are <em>reasoning</em> or simply <em>recalling</em>.</p><p>To remove this ambiguity, the authors designed a new testbed using four classic puzzles:</p><p><strong>Tower of Hanoi, Checker Jumping, River Crossing, Blocks World</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SIhe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SIhe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png 424w, https://substackcdn.com/image/fetch/$s_!SIhe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png 848w, https://substackcdn.com/image/fetch/$s_!SIhe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png 1272w, https://substackcdn.com/image/fetch/$s_!SIhe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SIhe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1281957,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/167664737?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SIhe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png 424w, https://substackcdn.com/image/fetch/$s_!SIhe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png 848w, https://substackcdn.com/image/fetch/$s_!SIhe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png 1272w, https://substackcdn.com/image/fetch/$s_!SIhe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2ecf-6fd7-44e6-b397-a7fc536a028b_3600x3600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These were chosen for two reasons:</p><ol><li><p><strong>Controllability</strong>: The complexity of each puzzle can be precisely scaled using variables like the number of disks, agents, or steps.</p></li><li><p><strong>Verifiability</strong>: Every move made by a model can be validated against a ground-truth simulator, allowing for exact evaluation, not just of the final answer, but the reasoning trace itself.</p></li></ol><p>This setup allowed them to focus on whether models could solve a new, clean problem through reasoning, not retrieval.</p><h3><strong>Key Finding #1: The Three Regimes of Complexity</strong></h3><p>The study compared reasoning-enhanced models (e.g. Claude 3.7 Sonnet Thinking, DeepSeek-R1) with their standard counterparts that lack explicit reasoning traces.</p><p>They found that performance could be broken down into <strong>three distinct regimes</strong>, based on task difficulty:</p><ul><li><p><strong>Low-Complexity Regime (e.g., Tower of Hanoi with N &lt; 5):</strong></p><p>Simpler tasks were solved more efficiently and accurately by standard models. The reasoning overhead in LRMs provided no benefit and sometimes made performance worse.</p></li><li><p><strong>Medium-Complexity Regime (e.g., Tower of Hanoi with N = 5&#8211;7):</strong></p><p>This is where the LRMs showed their strength. Their structured reasoning traces helped them outperform simpler models.</p></li><li><p><strong>High-Complexity Regime (e.g., Tower of Hanoi with N &#8805; 8):</strong></p><p>Across both model types, accuracy dropped sharply, often to zero. Even models designed for reasoning were unable to handle the increased compositional difficulty.</p></li></ul><p>This was described in the paper as a <strong>&#8220;complete performance collapse&#8221;</strong>, suggesting that beyond a certain point, current models cannot generalise effectively in these domains.</p><h3><strong>Key Finding #2: The Scaling Limit and &#8220;Giving Up&#8221; Behaviour</strong></h3><p>This finding reveals a surprising pattern in how models allocate effort.</p><p>Initially, as tasks became harder, models increased their <strong>reasoning trace length</strong>, a sign that they were engaging in more step-by-step processing.</p><p>But once complexity entered the high regime, something changed. Despite having enough token budget left to continue, the models <strong>started producing shorter reasoning traces</strong>, effectively reducing their own effort.</p><p>This suggests a <strong>fundamental limitation</strong> in how models internally assess and respond to increasing complexity, not just a resource issue, but potentially an architectural one. The model seems to &#8220;decide&#8221; it&#8217;s not worth trying.</p><h3><strong>Key Finding #3: Analysis of Reasoning Traces and Inconsistencies</strong></h3><p>Looking deeper into model behaviour, the authors observed:</p><ul><li><p><strong>Overthinking on Easy Problems</strong>:</p><p>In simple puzzles, models often found a valid solution early in their trace but continued generating unnecessary or incorrect steps, indicating inefficient use of reasoning capacity.</p></li><li><p><strong>No Clear Correlation Between Solution Length and Performance</strong>:</p><p>For example, models were able to execute 100+ sequential moves correctly in Tower of Hanoi but struggled with 5-step River Crossing puzzles.</p></li></ul><h3><strong>Key Finding #4: The Failure of Algorithm Execution</strong></h3><p>Perhaps the most important finding, especially for builders, is what happened when models were explicitly provided with the correct solution logic.</p><p>In this experiment, researchers gave the models the <strong>exact recursive algorithm</strong> for solving the Tower of Hanoi puzzle, directly embedded in the prompt.</p><p>The result? <strong>No improvement.</strong></p><p>Models still failed at the same complexity threshold.</p><p>This indicated that the failure isn&#8217;t just about the <em>ability to devise</em> an algorithm; it&#8217;s about the <em>ability to execute</em> a logically structured plan over multiple steps.</p><h2><strong>My Take: Separating the Signal from the Noise</strong></h2><p>Now, the real question: as builders, what should we actually take away from this?</p><h3>What the paper is right about</h3><p>Let&#8217;s start here, <strong>Apple is right about one thing</strong>: the way we evaluate models today needs serious work. Their push to go beyond contaminated benchmarks is exactly the kind of shift we need. Too many benchmarks reflect what a model might&#8217;ve already memorised from pretraining, not what it can genuinely reason through. Creating controlled testbeds like the ones in this paper is a step in the right direction, and a much-needed one.</p><p>But the idea that this somehow <strong>shatters the illusion of intelligence</strong> in today&#8217;s models? That&#8217;s where the paper starts to overreach.</p><h3>Where it begins to crack</h3><p>If you&#8217;ve ever actually built with these systems, you&#8217;ve seen this behaviour before. LLMs struggle with tasks they weren&#8217;t explicitly trained for. That&#8217;s not shocking, it&#8217;s expected.</p><p>Because let&#8217;s be real: these models are <strong>neural networks</strong>. What they&#8217;re <em>really</em> good at is pattern recognition. Even Reinforcement Learning, for all its flashiness, is still a form of statistical pattern shaping; it can produce impressive emergent behaviour, yes, but it&#8217;s not magic. </p><p>If a task or response format is new, unreinforced, or structurally unfamiliar, the model is likely to fail. Not because it isn&#8217;t &#8220;thinking&#8221; or &#8220;reasoning&#8221;, but because <strong>it wasn&#8217;t trained or incentivised to reason this way</strong>.</p><p>And what&#8217;s important: Apple&#8217;s setup didn&#8217;t just test abstract reasoning, it tested <strong>whether a model could reason and then output in a specific, (probably) non-incentivised format</strong>. That&#8217;s a big ask for any token predictor.</p><p>There&#8217;s also a human parallel worth noting. When people are presented with a complex, novel logic puzzle, they often fail, too, at least at first. </p><p>The key difference? We&#8217;re adaptive learners with the ability to learn on the fly (and mostly after the fact). We can Google it, watch someone solve it, or piece together a strategy from someone else&#8217;s prior experience. Models can&#8217;t. They&#8217;re frozen snapshots of past learning, not adaptive learners (and that&#8217;s the next frontier, to be honest)</p><p>And at the end of the day, these models are next-token predictors. That&#8217;s not just a technicality; it defines how they operate. They don&#8217;t &#8220;think&#8221; in plans or structured solutions. They think in <strong>tokens</strong>, one at a time, each choice guided by probabilities learned from their training data.</p><p>So when you ask a model to generate a <strong>single, perfect, long sequence of moves</strong>, you're setting up a statistical minefield. </p><p>This isn&#8217;t like solving a math problem where everything funnels toward a single, crisp answer. These puzzles require the model to explore a huge space of possibilities and commit to <strong>one flawless path</strong>, without deviation, all in one go.</p><p>But here&#8217;s the catch: every token generation is a probabilistic step. And because those probabilities are shaped by the entire soup of its pretraining data, even a slight nudge in the wrong direction, a faint echo of a similar pattern it once saw, can knock it off course. One small misstep, and the whole solution unravels.</p><p>And expecting a stochastic model to nail that path on the first try, misunderstands what it was trained to do.</p><p>So no, this doesn&#8217;t &#8220;debunk&#8221; the intelligence of LLMs. But the paper <em>does</em> surface two extremely important signals, especially if you're building agents or structured systems.</p><p><strong>1. The failure to execute is a big deal.</strong></p><p>This is the part we should be talking about more. When a model is handed a perfectly valid algorithm and <em>still</em> fails to follow it, this isn&#8217;t a reasoning failure, it&#8217;s a control failure.. It shows us that even when the strategy is in place, the model can&#8217;t consistently follow through. That has serious implications for any builder trying to create agents that follow structured plans or step-by-step workflows. (and that&#8217;s exactly why we need to engineer the systems around the models)</p><p><strong>2. The &#8220;giving up early&#8221; behaviour is a mystery worth solving.</strong></p><p>Why would a model stop reasoning halfway through a hard problem, even when it has tokens left? Is it a side effect of how we&#8217;ve trained them to prioritise brevity and confidence? A learned behaviour from RLHF that says &#8220;stop when you&#8217;re unsure&#8221;? Or is it something more nuanced?</p><p>Whatever the reason, it&#8217;s not just a random bug. It&#8217;s a <strong>consistent failure mode,</strong> and that&#8217;s worth investigating to push model capabilities forward.</p><p>Bottom line: while the headlines might overstate it, the paper gives us something to think about. But&#8230;. It doesn&#8217;t reveal an illusion, only some interesting blind spots.</p><h2><strong>Part 2: The Rebuttal &#8212; &#8220;The Illusion of the Illusion of Thinking&#8221;</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8n3J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8n3J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png 424w, https://substackcdn.com/image/fetch/$s_!8n3J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png 848w, https://substackcdn.com/image/fetch/$s_!8n3J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png 1272w, https://substackcdn.com/image/fetch/$s_!8n3J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8n3J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:10174485,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/167664737?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8n3J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png 424w, https://substackcdn.com/image/fetch/$s_!8n3J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png 848w, https://substackcdn.com/image/fetch/$s_!8n3J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png 1272w, https://substackcdn.com/image/fetch/$s_!8n3J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5eda7503-bbd4-4010-8b4e-4a639a445ec4_3600x3600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>The Peer Review, Served Cold (and Written by an LLM)</strong></h2><p>As the media went wild with &#8220;LLMs can&#8217;t think&#8221; headlines, the AI community braced for what usually follows a bold claim: a rebuttal. This one came swiftly, and with a title that might&#8217;ve earned a standing ovation in a research roast: <strong>"The Illusion of the Illusion of Thinking."</strong></p><p>But they didn&#8217;t stop at the title. The paper&#8217;s authors? &#8220;C. Opus&#8221; and &#8220;A. Lawsen.&#8221; That&#8217;s not a coincidence. That&#8217;s Claude Opus, Anthropic&#8217;s own model, credited as the lead author. Which, let&#8217;s be honest, is a flex. Claude itself was claiming authorship, as if to say, <em>&#8220;Not only can I reason, I&#8217;ll write the damn rebuttal.&#8221;</em></p><h3><strong>Deconstructing the &#8220;Collapse&#8221;: Anthropic&#8217;s Core Arguments</strong></h3><p>But underneath the flair, Anthropic presented a serious and methodical critique. Their key point? The collapse Apple observed wasn&#8217;t a failure of cognition; it was a function of how the experiments were designed. </p><p>They offer four central counterarguments:</p><p><strong>1. The &#8220;Collapse&#8221; Was Caused by Token Limits, Not Necessarily Reasoning Gaps</strong></p><p>The puzzles for instance Tower of Hanoi in Apple&#8217;s paper, required models to <strong>generate an exhaustive list of moves</strong> for N-disk problems, outputting every step in natural language. The number of moves needed grows exponentially with <code>N</code> (<code>2^N - 1</code>), and in text, that turns into <strong>quadratic or worse token growth</strong> due to verbosity.</p><p>Anthropic shows that the models&#8217; accuracy drops off precisely at the point where this format <strong>exceeds the model&#8217;s max output token limit,</strong> usually around N=7 or N=8 (also as seen in Apple&#8217;s paper). Critically, models <strong>recognise this constraint</strong> in their own outputs, explicitly writing things like:</p><p><em>&#8220;The pattern continues, but to avoid making this too long, I&#8217;ll stop here.&#8221;</em></p><p>Which implies, this isn&#8217;t a failure of reasoning, it&#8217;s <strong>budget-aware truncation</strong>.</p><p><strong>2. The Evaluation Included Mathematically Impossible Puzzles</strong></p><p>This was the methodological red flag.</p><p>In the River Crossing domain, Apple included scenarios where models were tasked with solving <strong>unsolvable constraint problems</strong>. For instance, trying to get 6+ agents across a river with a boat that only holds 3, <strong>without violating constraints like leaving incompatible agents alone</strong>, is mathematically impossible.</p><p>Anthropic correctly points out that treating these instances as solvable and then penalising models for not solving them is invalid. It&#8217;s the equivalent of handing a SAT solver an unsatisfiable formula and marking it wrong for returning &#8220;unsatisfiable.&#8221; In these cases, <strong>failure was the correct answer</strong>.</p><p><strong>3. Performance is Restored with a Better Solution Representation</strong></p><p>To isolate the issue, Anthropic tested the same models on the same puzzles, but changed the <strong>output format</strong>. Instead of requiring full move-by-move lists, they asked the models to <strong>output a recursive Lua function</strong> that prints the solution when called.</p><p>The result? Claude-3.7-Sonnet, Claude Opus 4, OpenAI o3, Google Gemini 2.5 all completed the Tower of Hanoi puzzle with <strong>N = 15</strong>, well beyond the collapse point Apple observed, using under <strong>5,000 tokens</strong> total.</p><p>This shows that the models <em>can</em> reason through the puzzle; the failure mode was the <strong>inefficiency of the output format</strong>, not necessarily a lack of logical ability.</p><p><strong>4. The Metric for Complexity Was Misleading</strong></p><p>Finally, Anthropic critiques Apple&#8217;s use of &#8220;compositional depth&#8221;, defined as the number of required moves, as a proxy for problem difficulty.</p><p>But here&#8217;s the issue: <strong>More steps &#8800; harder</strong>.</p><ul><li><p><strong>Tower of Hanoi</strong> has an exponential solution length but a known, deterministic recursive pattern. It&#8217;s algorithmically trivial once the rule is learned.</p></li><li><p><strong>River Crossing</strong>, on the other hand, involves <strong>constraint satisfaction</strong>, <strong>state tracking</strong>, and often multiple valid paths, making it <strong>search-intensive</strong> and NP-hard in complexity.</p></li></ul><p>So when a model succeeds on a 127-step Hanoi solution but fails on a 5-move River Crossing, that&#8217;s not inconsistency, it&#8217;s a reflection of two entirely different computational regimes. One is execution-heavy, the other reasoning-heavy.</p><p>And so the complexity regime breakup in Apple&#8217;s paper is essentially flawed and incorrectly assumes that more moves generally make the puzzle harder to solve.</p><h2><strong>The Rebuttal's Playbook: The New Rules for Testing AI</strong></h2><p>So, where does the rebuttal leave us? Anthropic doesn't just tear down the original experiment; they conclude with a new set of rules for the road. They cap it off with a line that should be pinned on the wall of every AI lab: <strong>"The question isn't whether LRMs can reason, but whether our evaluations can distinguish reasoning from typing."</strong></p><p>To that end, they propose a clear playbook for anyone serious about this work:</p><ol><li><p><strong>Stop Confusing Output with Understanding.</strong> An evaluation must be able to tell the difference between a model's core reasoning ability and its practical limits, like a finite context window. Don't penalise a model for being unable to write a million-word essay in a 100k-token box.</p></li><li><p><strong>Your Benchmark Must Be Solvable.</strong> This one should be obvious, but here we are. Before you test a model, first verify that the problem you're posing isn't mathematically impossible.</p></li><li><p><strong>Measure the Right Kind of "Hard."</strong> Stop using "solution length" as a lazy proxy for "difficulty." A truly useful metric must reflect the task's actual computational complexity, including the amount of search and planning required.</p></li><li><p><strong>Test for Algorithms, Not Just Answers.</strong> To prove a model understands the <em>how</em>, you have to be flexible with the <em>what</em>. Test for algorithmic understanding by allowing for multiple solution representations, like generating code, not just by checking for one rigid, exhaustive output.</p></li></ol><h1><strong>The Builder's Playbook: </strong></h1><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading NeoSage! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>4 Pillars for Working with "Reasoning" Models</strong></h2><p>So, after all the back-and-forth, what are the real, durable lessons for those of us in the trenches? It's not about picking a winner in the Apple vs. Anthropic debate (although my inclination is pretty clear by now). It's about upgrading our own mental models for how we build with these systems.</p><p>Here are the four pillars I believe matter most.</p><p><strong>1. Stop Asking "Can It Think?" &#8212; Start Asking "Is It Reliable?"</strong></p><p>The entire debate over whether a model is "thinking" is a philosophical distraction for an engineer. For a builder, the only question that matters is whether a model's behaviour is predictable, controllable, and reliable enough for a production system. The Apple paper, despite its methodological flaws, correctly identified that this reliability collapses under complexity. That collapse is a tangible engineering problem, whereas "thinking" is an academic one. Focus on what you can measure and control: reliability.</p><p><strong>2. Treat "Thinking" as a Debuggable Interface, Not a Mind.</strong></p><p>The Chain-of-Thought or "thinking" output from an LRM is not a window into a synthetic consciousness. It's a structured, debuggable API response. The Apple paper's most valuable contribution was using simulators to validate these traces step-by-step. The takeaway for us is to treat these thought processes as a powerful tool for observing failure modes. It's the most detailed error log you'll ever get. Use it to build external validation logic and to understand exactly where your system breaks.</p><p><strong>3. Your Job Is to Find the Right Problem Representation.</strong></p><p>The most important tactical lesson from the entire debate was Anthropic's "Representation Fix", asking for a function instead of a list. This should be elevated to a core engineering principle. Often, an engineer's most critical job when working with LLMs is not to build a better prompt, but to reframe the problem into a representation that the model can handle reliably and compactly. The model failing to write a 100,000-token answer is a limitation; the model succeeding at writing a 5,000-token function that <em>generates</em> that answer is a solution.</p><p><strong>4. Build Your Own Verification Layer. Always.</strong></p><p>If this debate taught us anything, it's that you can't blindly trust the model's output, and you can't blindly trust the public benchmarks used to evaluate it. The Apple paper used simulators to find failures. The Anthropic paper found that the benchmark itself was a failure. The unified lesson for builders is to trust neither. You must assume failure and build your own, domain-specific verification layers, just like the puzzle simulators, to check the model's output for correctness, safety, and format before it ever reaches a user or another production system.</p><p><strong>A final word from Nocto, who's had too much coffee to care about headlines:</strong> </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sFeH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sFeH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!sFeH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!sFeH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!sFeH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sFeH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6522415,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/167664737?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sFeH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!sFeH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!sFeH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!sFeH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f0178-3e54-4525-91d4-bc6a0a65c523_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p><em>The world will argue about consciousness. You should be arguing about your test coverage. Only one of those ships a product. </em></p><p><em>Stay Dangerous. Hoot</em></p></div><h1>References and Further Reading</h1><ul><li><p>Apple. <em><a href="https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf">The Illusion of Thinking</a></em></p></li><li><p>Anthropic. <em><a href="https://arxiv.org/html/2506.09250v1">The Illusion of the Illusion of Thinking</a></em></p></li></ul>]]></content:encoded></item><item><title><![CDATA[An Engineer's Guide to Fine-Tuning LLMs, Part 2: The Execution Playbook]]></title><description><![CDATA[A deep dive into the methods, fine-tuning pipeline, and operational risks of building specialised models.]]></description><link>https://blog.neosage.io/p/an-engineers-guide-to-fine-tuning-cd5</link><guid isPermaLink="false">https://blog.neosage.io/p/an-engineers-guide-to-fine-tuning-cd5</guid><dc:creator><![CDATA[Shivani Virdi]]></dc:creator><pubDate>Sat, 21 Jun 2025 16:11:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dDN6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>Introduction: From Strategy to Execution</strong></h2><p>In Part 1, we established the strategic framework. You now know where fine-tuning fits in the LLM value chain, the green flags that signal it's the right move, and the critical red flags that tell you to walk away. You&#8217;ve made the call.</p><p>But the decision is only the beginning. The gap between choosing to fine-tune and successfully deploying a specialised, reliable model is where most engineering teams stumble. It's a gap bridged not by hope, but by discipline.</p><p>This issue is the playbook for that discipline. We will walk through each critical stage of the fine-tuning loop:</p><ul><li><p><strong>Data Curation:</strong> How to build a high-quality dataset that defines your model's new behaviour.</p></li><li><p><strong>Methods and Trade-offs:</strong> How to choose the right tool for the job, from Full SFT to the efficiency of PEFT.</p></li><li><p><strong>The Core Loop:</strong> A deep dive into the mechanics of configuring, running, and evaluating a training job.</p></li><li><p><strong>Risk Management:</strong> A guide to identifying and preventing the common failure modes that can silently break your model.</p></li></ul><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;7d8a4d19-b868-4352-94d3-e88f2d0313ee&quot;,&quot;caption&quot;:&quot;You're building a Q&amp;A assistant for your internal analytics platform. You start with a powerful base model like Llama 3 or GPT-4o and implement a RAG (Retrieval-Augmented Generation) pipeline to feed it your table schemas and query examples.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;An Engineer&#8217;s Guide to Fine-Tuning LLMs &#8211; Part 1 &quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:329528627,&quot;name&quot;:&quot;Shivani Virdi&quot;,&quot;bio&quot;:&quot;Engineering at Microsoft | Simplifying AI for Everyone | Empowering Productivity with Proven Frameworks and Processes&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d15370b-dcd2-4300-be03-cf811f0f45d9_862x862.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-06-12T06:33:43.478Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.neosage.io/p/an-engineers-guide-to-fine-tuning&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:165758078,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:10,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;NeoSage&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8266222-d17f-4639-a529-67ae92f79bb1_1024x1024.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>Welcome to Part 2. Let's get to work.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.neosage.io/subscribe?"><span>Subscribe now</span></a></p><h2><strong>1. Designing the Fine-Tuning Loop: A Systems View</strong></h2><p>The biggest myth in fine-tuning is that it's a one-and-done process. That you can build the perfect dataset, push a button, and get a production-ready model on the first try.</p><p>The engineering reality is different: successful fine-tuning is a loop.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dDN6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dDN6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!dDN6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!dDN6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!dDN6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dDN6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:528291,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/166406050?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dDN6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!dDN6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!dDN6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!dDN6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdac3d946-44ad-42fc-aa20-ad47f6faa16e_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This loop has five distinct stages:</p><ol><li><p><strong>Define the Task:</strong> Get crystal clear on the <em>specific behaviour</em> you are trying to teach. Is it a format, a style, a reasoning pattern, or a classification skill? A vague goal leads to a vague model.</p></li><li><p><strong>Curate the Dataset:</strong> Build a high-quality dataset that is a perfect representation of the target behaviour. This is the specification for your model.</p></li><li><p><strong>Choose a Method &amp; Train:</strong> Select the right technique for your goal and budget (e.g., PEFT vs. Full SFT) and execute the training job.</p></li><li><p><strong>Evaluate the Result:</strong> Rigorously test the model's performance, not just on metrics, but on its qualitative behaviour. Find where it fails.</p></li><li><p><strong>Refine and Repeat:</strong> Analyze the failures, use those insights to improve your dataset or training configuration, and begin the loop again.</p></li></ol><p>This brings us to the most important principle for builders: you must reject the <strong>"one-shot tuning fallacy."</strong></p><p>Your goal for the first pass is not to build a perfect model. It is to build a <strong>Minimum Viable Model (MVM)</strong> whose primary function is to fail in interesting and informative ways. Those failures, the edge cases it gets wrong, the formats it breaks, the biases it reveals, are the most valuable signal you have. They are the data you will use to refine your process for the next iteration.</p><p>The following sections are a deep dive into each stage of this loop. We'll start with the most critical input, and the foundation of all behaviour: your data.</p><h2><strong>2. Data Curation: The Foundation of Behaviour</strong></h2><p>In the fine-tuning loop, no stage has more leverage than data curation. The model, training script, and hyperparameters are important, but the dataset is the foundation upon which everything is built. If your data is flawed, no amount of clever engineering can save the project.</p><p>Think of it this way: <strong>your dataset is the source code for your model's new behaviour.</strong> Every example is a line of code that defines how the model should think, act, and respond. Your job is to write the cleanest, most intentional code possible.</p><h3><strong>The Anatomy of a "Golden" Example</strong></h3><p>Before discussing quantity or format, let's define what a single, high-quality data point looks like. A "golden" example isn't just an input and an output; it's a perfect demonstration of the exact behaviour you want to instill.</p><p>It contains three parts:</p><ol><li><p><strong>The Instruction (The Task):</strong> A clear, unambiguous prompt that defines the task the model should perform.</p></li><li><p><strong>The Context (The Input, optional):</strong> Any additional information the model needs to perform the task, such as a user query or a piece of text to summarize.</p></li><li><p><strong>The Completion (The Target Behaviour):</strong> The ideal response. This is the most important part, it must be a perfect example of the desired tone, format, and reasoning pattern.</p></li></ol><ul><li><p><strong>Illustrative Scenario:</strong> You want to fine-tune a model to be a helpful but firm support agent that politely deflects feature requests that are out of scope.</p><p>A golden example would look like this:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pd5b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pd5b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!pd5b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!pd5b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!pd5b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pd5b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png" width="728" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:239226,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/166406050?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pd5b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!pd5b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!pd5b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!pd5b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc93f6b84-b5b9-456b-9ff7-85661de476b5_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This single example teaches the model the desired tone (polite, appreciative), the core task (deflection), and the correct format (a helpful, closing question).</p><h3><strong>Standard Data Formats</strong></h3><p>Your dataset must be formatted precisely for the tools you're using. The two most common structures are:</p><ul><li><p><strong>For the OpenAI API:</strong> A .jsonl file where each line is a JSON object containing a list of messages. This format models a conversation and requires specifying roles (system, user, assistant).</p></li></ul><pre><code><code>{"messages": [{"role": "system", "content": "You are a helpful but firm support agent."}, {"role": "user", "content": "Can you add Klingon language support?"}, {"role": "assistant", "content": "That's a creative idea! While we don't currently have plans to add Klingon, I've passed your feedback along to the team."}]}</code></code></pre><ul><li><p><strong>For Open-Source Models (Hugging Face TRL):</strong> Typically, a list of dictionaries. The structure can vary, but a common format for instruction-following is a dictionary with keys like instruction, input, and output, often formatted into a single string with special tokens.</p></li></ul><pre><code><code>[
  {
    "instruction": "You are a support agent who must politely decline feature requests...",
    "input": "User query: 'Will you add interplanetary communication protocols?'",
    "output": "That's a fascinating question! While we're focused on terrestrial communication for now..."
  }
]</code></code></pre><h3><strong>The "Quality Over Quantity" Mandate</strong></h3><p>The most common myth in fine-tuning is that you need a massive dataset. The reality is that <strong>1,000 high-quality, curated examples will outperform 50,000 noisy, inconsistent examples every time.</strong></p><p>Fine-tuning is a process of pattern imitation. A small, clean dataset teaches the model a clear, strong pattern to follow. A large, noisy dataset teaches the model a confusing, noisy pattern, resulting in erratic behaviour. Your goal is to create the strongest, cleanest signal possible.</p><h3><strong>Data Sourcing and Cleaning</strong></h3><ul><li><p><strong>Sourcing:</strong> High-quality data often comes from existing human-in-the-loop processes, such as support tickets handled by your best agents or documents written by domain experts. Alternatively, you can use a powerful "teacher" model (like GPT-4o) to generate a <strong>synthetic dataset</strong>, but this requires careful prompting and rigorous quality control.</p></li><li><p><strong>Cleaning:</strong> Before training, your dataset must be cleaned. This is a non-negotiable step.</p><ul><li><p><strong>Remove Duplicates:</strong> Identical or near-identical examples don't add value and can bias the model.</p></li><li><p><strong>Filter for Quality:</strong> Remove examples that are unclear, contain errors, or don't strongly represent the target behaviour.</p></li><li><p><strong>Check for PII:</strong> Scrub all personally identifiable information from your dataset to protect user privacy.</p></li><li><p><strong>Ensure Consistency:</strong> The tone, style, and format across all of your examples should be as consistent as possible.</p></li></ul></li></ul><p>With a high-quality dataset curated and formatted, you have laid the foundation. The next step is to choose the right engine to power the training process.</p><h2><strong>3. Methods and Trade-offs: Choosing Your Engine</strong></h2><p>With a high-quality dataset ready, your next critical decision is choosing the right engine for the job. The "best" fine-tuning method doesn't exist; the right choice is a direct function of your goal, budget, and performance needs.</p><p>This section is your guide to making that trade-off, breaking down the core training methods and advanced techniques for production efficiency.</p><h3><strong>Core Training Methods</strong></h3><p><strong>1. Full Fine-Tuning (SFT)</strong></p><ul><li><p><strong>What it is:</strong> The most comprehensive approach, where you update <strong>every weight</strong> in a pretrained model using your labelled dataset.</p></li><li><p><strong>Where it shines:</strong> When you need to teach the model a new, complex skill from scratch (e.g., mastering a highly specialised grammar) and have a massive, high-quality dataset to maximise performance.</p></li><li><p><strong>The Trade-off:</strong> It is prohibitively expensive and carries a high risk of <strong>overfitting</strong> on smaller datasets and <strong>catastrophic forgetting</strong> of the model's general capabilities. While techniques like regularisation can help mitigate overfitting, they add to the complexity.</p></li></ul><p><strong>2. Parameter-Efficient Fine-Tuning (PEFT)</strong></p><ul><li><p><strong>What it is:</strong> A family of techniques, like LoRA and QLoRA, that freezes the vast majority of the model's weights and only trains a very small number of new, "adapter" parameters.</p></li><li><p><strong>Where it shines:</strong> This is the default choice for most use cases, especially for adapting a model's style, format, or domain-specific knowledge. It achieves performance highly comparable to a full fine-tune on these tasks at a fraction of the cost.</p></li><li><p><strong>The Trade-off:</strong> While very powerful, PEFT has limitations when teaching capabilities that are very distant from the base model's knowledge. Its effectiveness on entirely new skills depends on the adapter configuration and task complexity.</p></li></ul><p><strong>3. Preference Tuning</strong></p><ul><li><p><strong>What it is:</strong> A method for aligning a model to subjective human preferences (like helpfulness or brand voice) using <code>chosen</code> vs. <code>rejected</code> response pairs. The two main approaches are:</p><ul><li><p><strong>RLHF (Reinforcement Learning from Human Feedback):</strong> A complex process that first trains a separate "reward model" on human preferences, then uses RL to tune the LLM.</p></li><li><p><strong>DPO (Direct Preference Optimization):</strong> A more modern and stable method that uses a direct loss function on preference pairs to <strong>implicitly optimize</strong> the same objective as RLHF, avoiding the complexity of training a separate reward model.</p></li></ul></li><li><p><strong>Where it shines:</strong> Excellent for subjective qualities like tone, personality, and safety, where there isn't a single correct output. DPO is now the standard due to its simplicity and stability.</p></li><li><p><strong>The Trade-off:</strong> It requires expensive human-labelled preference data and does not guarantee factual correctness; it only optimises for being <em>preferred</em>.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M849!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac398596-f348-482c-891d-e004c90270d3_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M849!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac398596-f348-482c-891d-e004c90270d3_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!M849!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac398596-f348-482c-891d-e004c90270d3_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!M849!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac398596-f348-482c-891d-e004c90270d3_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!M849!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac398596-f348-482c-891d-e004c90270d3_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M849!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac398596-f348-482c-891d-e004c90270d3_2400x2400.png" width="728" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac398596-f348-482c-891d-e004c90270d3_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:508006,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/166406050?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac398596-f348-482c-891d-e004c90270d3_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M849!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac398596-f348-482c-891d-e004c90270d3_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!M849!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac398596-f348-482c-891d-e004c90270d3_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!M849!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac398596-f348-482c-891d-e004c90270d3_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!M849!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac398596-f348-482c-891d-e004c90270d3_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Advanced Techniques for Efficiency and Scale</strong></h3><p>These are powerful techniques you can use to make your model viable for production.</p><p><strong>1. Quantization</strong></p><ul><li><p><strong>What it is:</strong> A compression technique that reduces the numerical precision of the model's weights (e.g., from 16-bit to 4-bit). While it's often applied <strong>post-training</strong>, more advanced methods like <strong>Quantization-Aware Training (QAT)</strong> apply it <em>during</em> the fine-tuning process for more robust performance.</p></li><li><p><strong>Why you use it:</strong> To shrink a model's memory footprint so it can be served on smaller, cheaper GPUs.</p></li><li><p><strong>The Trade-off:</strong> It is a "lossy" compression, which can degrade model performance. Significant inference speed-ups are also not guaranteed and depend on your specific hardware supporting low-bit operations.</p></li></ul><p><strong>2. Distillation</strong></p><ul><li><p><strong>What it is:</strong> A training technique where a smaller "student" model is trained to mimic the outputs of a larger "teacher" model. This is often done by training the student on the teacher's output probabilities (<strong>logits</strong>) or intermediate representations, effectively transferring the teacher's "reasoning process."</p></li><li><p><strong>Why you use it:</strong> To get the performance of a state-of-the-art "teacher" model (like GPT-4o) in a small, fast, and cheap "student" model that can be served at scale.</p></li><li><p><strong>The Trade-off:</strong> You transfer a specific <em>skill</em> with high efficiency, but the student may lose some of the teacher's general nuance and will likely not outperform it on out-of-domain tasks.</p></li></ul><p>Choosing your engine is a crucial architectural decision. Whether you opt for the raw power of Full SFT or the surgical efficiency of PEFT, you are making a deliberate trade-off between capability and cost.</p><p>With your method selected, it's time to enter the core of the playbook: the iterative loop of training and evaluation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mtLs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mtLs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!mtLs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!mtLs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!mtLs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mtLs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png" width="724" height="362" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:346402,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/166406050?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mtLs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!mtLs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!mtLs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!mtLs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c47a2e-50f4-4c8e-ac65-be81634da611_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>4. The Training and Evaluation Loop</strong></h2><p>You have your dataset, and you've chosen your method. Now you enter the engine room of the playbook: the iterative loop of training a model and rigorously evaluating its behaviour. This is where the real work of shaping your model happens.</p><h3><strong>The Training Run: Configuration and Monitoring</strong></h3><p>A successful training run is not about luck; it's about a correct and thoughtful configuration. While there are dozens of parameters you can set, a few are critical for success.</p><ul><li><p><strong>Key Hyperparameters to Configure:</strong></p><ul><li><p><strong>learning_rate</strong>: This is the most sensitive dial. For fine-tuning, you need a very low learning rate (e.g., <code>5e-5</code> to <code>2e-5</code>) to stably adapt the model. For even more stability, this is often paired with a <strong>learning rate scheduler</strong> (like cosine decay) that gradually decreases the rate during training.</p></li><li><p><strong>num_train_epochs</strong>: The number of times the model will see your entire training dataset. For large datasets, this is often just 1-3 to prevent overfitting.</p></li><li><p><strong>per_device_train_batch_size</strong>: The number of examples processed per GPU at once. A larger batch size can lead to more stable and faster convergence, but it must be tuned to fit your GPU memory constraints.</p></li></ul></li><li><p><strong>Monitoring the Run: Interpreting the Loss Curve</strong></p><ul><li><p>As the model trains, you must monitor its <strong>train_loss</strong> (the error on the data it's actively learning from) and your <strong>eval_loss</strong> or <strong>validation_loss</strong> (the error on a held-out dataset to check for generalisation). This held-out validation dataset is critical&#8212;it must be high-quality and representative of your production data to give you an honest signal.</p></li><li><p>The shape of these curves tells you what's happening. A healthy run shows both curves decreasing. If your <strong>train_loss</strong> continues to fall while your <strong>validation_loss</strong> stagnates or rises, your model is <strong>overfitting</strong>. This is a clear signal to stop training to save the best-performing checkpoint.</p></li></ul></li></ul><h3><strong>The Evaluation Phase: Did It Actually Work?</strong></h3><p>Here is a critical truth of fine-tuning: <strong>a low validation loss does not mean your model is good.</strong> It only means your model got good at predicting the next token in your specific validation set. It says nothing about its real-world behaviour, safety, or reliability.</p><p>A professional evaluation strategy is a portfolio of different tests, always <strong>benchmarked against the original base model's performance</strong> to clearly measure improvement and detect regressions.</p><h3><strong>A Modern Evaluation Toolkit</strong></h3><p><strong>1. Quantitative Metrics (for Objective Tasks)</strong></p><p>For tasks with a clear right or wrong answer, you should use automated, quantitative metrics.</p><ul><li><p><strong>Use Case:</strong> Classification tasks, where you can measure <strong>accuracy</strong>, <strong>precision</strong>, <strong>recall</strong>, and <strong>F1-score</strong>.</p></li><li><p><strong>Use Case:</strong> Structured data generation. If you're fine-tuning a model to output JSON, your most important metric is simply: "Is the output 100% parsable?" You can programmatically validate this against your required schema.</p></li></ul><p><strong>2. Qualitative Human Review (for Subjective Tasks)</strong></p><p>For tasks involving style, tone, or nuanced instructions, human evaluation is non-negotiable. Automated metrics cannot tell you if a response "feels" right.</p><ul><li><p><strong>What to look for:</strong> Does the model consistently adopt the desired persona? Is it genuinely helpful? Has it developed any new, undesirable habits?</p></li><li><p><strong>Best Practice:</strong> As you discover new failure modes through human review, feed them back into your evaluation set. A static evaluation set becomes stale over time and allows for regressions on problems you thought you had solved.</p></li></ul><p><strong>3. LLM-as-a-Judge (for Scalable Qualitative Evals)</strong></p><p>This is a powerful, modern technique that uses a state-of-the-art model (like GPT-4o or Claude 4 Opus) as a scalable proxy for human evaluators.</p><ul><li><p><strong>How it works:</strong> You present the "judge" LLM with the input prompt, your model's generated response, and a detailed rubric. The judge then scores the response based on the rubric's criteria.</p></li><li><p><strong>Pro Tip:</strong> Use a hybrid approach for efficiency. Screen thousands of outputs with a cheap LLM-as-a-Judge, and escalate only the difficult or borderline cases to more expensive human reviewers.</p></li><li><p><strong>Key Pitfalls:</strong> This method is powerful but has known biases:</p><ul><li><p><strong>Verbosity Bias:</strong> Tends to prefer longer, more detailed responses, even if they aren't better.</p></li><li><p><strong>Positional Bias:</strong> Can favour the first answer it sees in a side-by-side comparison.</p></li><li><p><strong>Self-Enhancement Bias:</strong> Often gives higher scores to outputs from its own model family (e.g., GPT-4 judging GPT-4).</p></li></ul></li></ul><p><strong>4. Behavioural Regression Testing</strong></p><p>This is your model's "unit test" suite. Before you even start fine-tuning, you should create a fixed set of a few dozen hand-crafted prompts that test for critical, must-have behaviours.</p><ul><li><p><strong>What it tests for:</strong></p><ul><li><p><strong>Safety:</strong> Does the model still refuse to answer harmful questions?</p></li><li><p><strong>Regressions:</strong> Has the model "forgotten" how to do a simple task that it could do before?</p></li><li><p><strong>Edge Cases:</strong> Does it correctly handle the specific edge cases you care most about? Run this test suite after every fine-tuning run to ensure you haven't introduced a new problem while fixing another.</p></li></ul></li></ul><p>The insights from this rigorous evaluation are not the end of the process. They are the input for the next iteration of the loop, feeding back into your data curation and allowing you to systematically improve the model's behaviour.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YHcx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YHcx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!YHcx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!YHcx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!YHcx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YHcx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:404519,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/166406050?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YHcx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!YHcx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!YHcx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!YHcx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74863e6b-93d5-45de-b2d1-ea03fb267023_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>5. Risk Management: Safety and Failure Modes</strong></h2><p>Fine-tuning gives you the power to specialise a model, but it also gives you the power to break it in subtle and dangerous ways. The most critical risk is that the process of teaching a model a new skill can override its carefully constructed, general-purpose safety alignment.</p><p>Understanding these failure modes is not optional; it's a core competency of responsible model development.</p><h3><strong>The Primary Risk: Safety Alignment Collapse</strong></h3><p>State-of-the-art base models have undergone extensive safety tuning on millions of examples to make them refuse harmful requests. When you fine-tune a model, even on a seemingly benign dataset of just a few thousand examples, you create a distributional shift. This new data can "drown out" the original safety training, creating new <strong>adversarial vulnerabilities</strong> or "jailbreaks."</p><p>This safety collapse is particularly dangerous because the risk can be high not only when your data is very different from the base model's training, but also when its distribution is too <strong>similar to the safety-tuning data</strong>, which can confuse the model into overriding its refusal logic.</p><ul><li><p><strong>Illustrative Scenario:</strong> A team fine-tunes a model to be a "witty, sarcastic chatbot" for a gaming community. The training data contains no explicitly harmful content. However, when users in production start interacting with borderline-toxic language, the model now responds with equally toxic sarcasm instead of the firm refusal it was originally trained for. The new "persona" has overwritten its safety layer.</p></li></ul><h3><strong>A Checklist of Technical Failure Modes</strong></h3><p>Beyond safety alignment, several other technical failures can emerge during the fine-tuning process.</p><p><strong>1. Catastrophic Forgetting</strong></p><ul><li><p><strong>The Problem:</strong> The model becomes highly specialised on your tuning data but loses general capabilities it previously possessed, such as world knowledge, multilingual fluency, or even the ability to perform simple reasoning. This happens when the tuning data is too narrow and overwrites the model's foundational weights.</p></li><li><p><strong>How to Prevent It:</strong></p><ul><li><p><strong>Use PEFT:</strong> This is the best defence. Since methods like LoRA leave the base model weights frozen, they inherently protect against catastrophic forgetting.</p></li><li><p><strong>Use Mixed Datasets:</strong> If using a full fine-tune, augment your specialised dataset with a small percentage (5-10%) of diverse, general-purpose data to keep the original capabilities "active."</p></li><li><p><strong>Run Regression Evals:</strong> Test the tuned model against broad academic benchmarks (e.g., MMLU) to programmatically quantify any drop in general reasoning.</p></li></ul></li></ul><p><strong>2. Overfitting and Mode Collapse</strong></p><ul><li><p><strong>The Problem:</strong> The model learns the <em>style</em> of your training examples so perfectly that it loses all creativity and diversity in its responses. This "mode collapse" leads to generic, repetitive outputs, making the model feel flat and robotic.</p></li><li><p><strong>How to Prevent It:</strong></p><ul><li><p><strong>Ensure Dataset Diversity:</strong> For creative tasks, include multiple valid and varied completions for the same input prompt.</p></li><li><p><strong>Tune for Fewer Epochs:</strong> Overfitting is often a sign of training for too long. For many tasks, 1-2 epochs are sufficient.</p></li><li><p><strong>Monitor Output Diversity:</strong> During evaluation, track metrics like n-gram diversity to programmatically detect a drop in creativity.</p></li></ul></li></ul><p><strong>3. Bias Amplification</strong></p><ul><li><p><strong>The Problem:</strong> Fine-tuning is a powerful amplifier. Any social or demographic biases present in your training data&#8212;even subtle ones&#8212;will be learned and often exaggerated by the fine-tuned model, leading to unfair or inequitable behaviour.</p></li><li><p><strong>How to Prevent It:</strong></p><ul><li><p><strong>Pre-Training Data Audits:</strong> Before you begin, rigorously audit your dataset for representation skews and potential sources of social bias.</p></li><li><p><strong>Safeguard Synthetic Data:</strong> If using an LLM to generate synthetic data, rigorously audit those outputs for biases inherited from the generator model before using them for fine-tuning.</p></li><li><p><strong>Slice-Based Evaluations:</strong> Do not rely on aggregate metrics. You must evaluate your model's performance on different "slices" of data (e.g., grouped by demographic attributes) to detect and measure where its behaviour is inequitable.</p></li></ul></li></ul><p>A professional approach to fine-tuning requires defence-in-depth. This includes not just reactive testing but also proactive safety measures, such as <strong>gradient filtering</strong> to prevent the model from learning from harmful data and exploring <strong>continuous alignment</strong> techniques to ensure safety is maintained throughout the tuning process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XWUX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XWUX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!XWUX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!XWUX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!XWUX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XWUX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:474901,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/166406050?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XWUX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!XWUX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!XWUX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!XWUX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a1dcc1c-4a50-48f3-86b6-e47d26cf01ca_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Conclusion</strong></h2><p>Across this two-part guide, we&#8217;ve systematically dismantled the "black box" of fine-tuning and replaced it with an engineering playbook. You started with the strategy of "when" and "why," and have now walked through the execution of "how."</p><p>Here is the focused intuition you've built:</p><ul><li><p>You know that fine-tuning is about changing a model&#8217;s core <strong>behaviour</strong>, not just its knowledge&#8212;a surgical tool you reach for only when prompting and RAG are no longer enough.</p></li><li><p>You have a clear <strong>decision-making framework</strong>: a set of green flags that signal when to commit (like enforcing structure or mastering a complex task) and the critical red flags that tell you to stop (like insufficient data or the need for immediate control).</p></li><li><p>You see the process not as a single event, but as an iterative <strong>engineering loop</strong>: a disciplined cycle of curating data, training, and rigorously evaluating for what actually matters.</p></li><li><p>You can navigate the <strong>methods and trade-offs</strong>, choosing between the raw power of Full SFT, the efficiency of PEFT, and the performance-at-scale of distillation to match your specific constraints.</p></li><li><p>And you know the <strong>risks</strong>, with a clear understanding of how to mitigate common failures like catastrophic forgetting, bias amplification, and the degradation of safety alignment.</p></li></ul><p>But let&#8217;s be clear:</p><div class="pullquote"><p>Fine-tuning isn't a checkbox.<br>It's a surgical override on model behaviour<br>And every override comes with a responsibility.</p></div><p>At NeoSage, we don&#8217;t just teach tools. We teach how to think with them.</p><p>And as Nocto would whisper from the shadows of your prompt window</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z6oz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z6oz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Z6oz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Z6oz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Z6oz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z6oz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png" width="728" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:1541757,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/166406050?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Z6oz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Z6oz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Z6oz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Z6oz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4dcfc50-f469-4bd4-a87f-0dc9280f59cf_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p>&#8220;Steer with context. Train with care. And never change the weights unless you&#8217;ve earned the right.&#8221;</p></div><p><strong>Your intuition is now tuned. </strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading NeoSage! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>References &amp; Further Reading</strong></h2><ul><li><p><a href="https://arxiv.org/abs/2106.09685">LoRA: Low-Rank Adaptation of Large Language Models</a></p></li><li><p><a href="https://arxiv.org/abs/2305.14314">QLoRA: Efficient Finetuning of Quantized LLMs</a></p></li><li><p><a href="https://ai.meta.com/blog/how-to-fine-tune-llms-peft-dataset-curation/">Meta &#8212; How to Fine-Tune LLMs: PEFT + Dataset Curation</a></p></li><li><p><a href="https://ai.meta.com/blog/adapting-large-language-models-llms/">Meta &#8212; Adapting Large Language Models</a></p></li><li><p><a href="https://huggingface.co/docs/peft/index">Hugging Face PEFT Documentation</a></p></li><li><p><a href="https://arxiv.org/abs/2210.11416">FLAN: Scaling Instruction-Finetuned Models</a></p></li><li><p><a href="https://arxiv.org/abs/2203.02155">InstructGPT: Aligning Language Models with Human Feedback</a></p></li><li><p><a href="https://arxiv.org/abs/2305.18290">Direct Preference Optimization (DPO)</a></p></li><li><p><a href="https://arxiv.org/html/2506.05346v1">Why LLM Safety Guardrails Collapse After Fine-tuning</a></p></li><li><p><a href="https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset">OpenAI &#8212; Preparing Your Dataset for Fine-Tuning</a></p></li><li><p><a href="https://arxiv.org/abs/2402.13116">Knowledge Distillation for LLMs (Survey)</a></p></li><li><p><a href="https://huggingface.co/docs/trl">Hugging Face TRL Documentation</a></p></li><li><p><a href="https://arxiv.org/abs/2306.05685">Judging LLM-as-a-judge with MT-Bench and Chatbot Arena</a></p></li><li><p><a href="https://arxiv.org/abs/2212.10560">Self-Instruct: Aligning LLMs with Self-Generated Instructions</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[An Engineer’s Guide to Fine-Tuning LLMs – Part 1 ]]></title><description><![CDATA[Understand where fine-tuning fits in the LLM lifecycle, how it compares to prompting or retrieval &#8212; and when it&#8217;s the right tool for the job.]]></description><link>https://blog.neosage.io/p/an-engineers-guide-to-fine-tuning</link><guid isPermaLink="false">https://blog.neosage.io/p/an-engineers-guide-to-fine-tuning</guid><dc:creator><![CDATA[Shivani Virdi]]></dc:creator><pubDate>Thu, 12 Jun 2025 06:33:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kNHp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kNHp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!kNHp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!kNHp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!kNHp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kNHp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:353689,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/165758078?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kNHp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!kNHp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!kNHp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!kNHp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72949c2-011a-4fdc-8896-afa11a889126_2400x1200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You're building a Q&amp;A assistant for your internal analytics platform. You start with a powerful <strong>base model</strong> like Llama 3 or GPT-4o and implement a <strong>RAG (Retrieval-Augmented Generation)</strong> pipeline to feed it your table schemas and query examples.</p><p>It works, but only up to a point. Soon, the cracks appear:</p><ul><li><p><strong>Inconsistent Formatting</strong>: The model ignores your specified output structure, like consistently generating clean SQL or JSON.</p></li><li><p><strong>Brittle Prompts</strong>: You're constantly tweaking prompts and few-shot examples just to maintain predictable behaviour for slightly different user inputs.</p></li><li><p><strong>Poor Steerability</strong>: The model fails to adhere to specific constraints, like always using <code>JOIN</code> on the correct foreign key or avoiding deprecated functions.</p></li></ul><p>You're no longer just guiding the model; you're fighting its fundamental tendencies.</p><p>This isn't a <em>knowledge</em> problem; RAG is already providing the necessary context. This is a <strong>behaviour problem</strong>.</p><ul><li><p><strong>Prompting</strong> is about giving the model better <em>instructions</em>.</p></li><li><p><strong>RAG</strong> is about giving the model better <em>knowledge</em>.</p></li><li><p><strong>Fine-tuning</strong> is about teaching the model a new <em>skill</em>.</p></li></ul><p>Fine-tuning fundamentally changes the model itself. By updating the model's internal weights on your own curated data, you aren't just telling it how to act&#8212;you are reshaping it to <em>be</em> the model your product needs. It internalises your specific data structures, response formats, and desired logic.</p><p><strong>In Part 1 of this two-part issue, we'll cover:</strong></p><ul><li><p><strong>The Core Mechanism:</strong> Understand what fine-tuning actually changes in a model and why it's a completely different tool than prompting or RAG.</p></li><li><p><strong>The Strategic Context:</strong> See where fine-tuning fits in the LLM lifecycle to understand its unique power as an application-layer tool.</p></li><li><p><strong>The Decision Framework:</strong> Get a clear set of green flags for when to commit to fine-tuning, and the critical red flags that tell you it will waste your time.</p></li></ul><p>Let's dive in.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.neosage.io/subscribe?"><span>Subscribe now</span></a></p><h1>1. The LLM Value Chain: Pre-training, Alignment, and Specialisation</h1><h3><strong>(Recap from Issues 1 &amp; 2)</strong></h3><p>To understand why fine-tuning exists, you need to see where it fits in the lifecycle of a language model. Let's stitch the layers together. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h2Cz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h2Cz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!h2Cz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!h2Cz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!h2Cz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h2Cz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:289498,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/165758078?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h2Cz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!h2Cz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!h2Cz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!h2Cz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d0ce60-9751-4f90-968b-00fdd093a8ae_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>1.1 Pre-training: The Foundation Layer</strong></h3><p>This is where models like GPT-4o, Claude, and DeepSeek-R1 begin. Pre-training is unsupervised learning on massive text corpora by predicting the next token in a sequence. No tasks, no labels, just pure pattern recognition at scale. This gives the model general linguistic competence and vast factual knowledge, but it doesn&#8217;t know how to follow instructions safely or effectively. </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;32e0d200-45d8-44ca-8f73-98a5c75a8d65&quot;,&quot;caption&quot;:&quot;Introduction&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How GPTs Are Born: Internet Feeding, Token by Token&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:329528627,&quot;name&quot;:&quot;Shivani Virdi&quot;,&quot;bio&quot;:&quot;Engineering at Microsoft | Simplifying AI for Everyone | Empowering Productivity with Proven Frameworks and Processes&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d15370b-dcd2-4300-be03-cf811f0f45d9_862x862.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-16T14:13:19.022Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.neosage.io/p/how-gpts-are-born-internet-feeding&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:161399912,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:25,&quot;comment_count&quot;:8,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;NeoSage&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8266222-d17f-4639-a529-67ae92f79bb1_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><strong>1.2 Alignment: Teaching the Model to Behave</strong></h3><p>A raw, pre-trained model needs to be made useful. The alignment phase teaches it to:</p><ul><li><p>Respond to instructions clearly</p></li><li><p>Refuse unsafe prompts</p></li><li><p>Format answers correctly</p></li><li><p>Align with human expectations</p></li></ul><p>This is typically done via a suite of fine-tuning techniques:</p><ul><li><p><strong>Supervised Fine-Tuning (SFT):</strong> Training on instruction-response pairs to teach task formats.</p></li><li><p><strong>RLHF &amp; DPO:</strong> Optimising outputs based on human preference pairs to guide model behaviour.</p></li></ul><div class="pullquote"><p>Technically, fine-tuning means updating a pre-trained model&#8217;s weights with new data, which is exactly what happens here. So when you use GPT-4o, you&#8217;re already using a model that has been fine-tuned by its provider for general helpfulness, but not for your specific domain or product.</p></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;0e928969-33ca-4e8a-b8b1-91ee3290e74a&quot;,&quot;caption&quot;:&quot;Introduction&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How GPTs Learn to Be Helpful&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:329528627,&quot;name&quot;:&quot;Shivani Virdi&quot;,&quot;bio&quot;:&quot;Engineering at Microsoft | Simplifying AI for Everyone | Empowering Productivity with Proven Frameworks and Processes&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d15370b-dcd2-4300-be03-cf811f0f45d9_862x862.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-23T16:22:18.145Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.neosage.io/p/how-gpts-learn-to-be-helpful&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:161930085,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:13,&quot;comment_count&quot;:7,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;NeoSage&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8266222-d17f-4639-a529-67ae92f79bb1_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><strong>1.3 Application-Layer Fine-Tuning: The Missing Layer</strong></h3><p>This brings us to the layer you control. As a builder, you don&#8217;t control pre-training or the base alignment objectives. But you can fine-tune the model again, on your own domain, tasks, and constraints, to make it work inside your system.</p><p>This issue is about that layer.</p><p>It's the same mechanism for a different purpose. Fine-tuning, not to teach the model how to behave, but to make it behave <em>your way</em>.</p><h1><strong>2. What Fine-Tuning Really Is</strong></h1><p>In the last section, we mapped out the LLM value chain. Now, it's time to get precise about the layer you control. What exactly is fine-tuning, what makes it fundamentally different from prompting or retrieval, and why is it the only method that changes the model itself?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dNZ8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dNZ8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!dNZ8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!dNZ8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!dNZ8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dNZ8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:507397,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/165758078?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dNZ8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!dNZ8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!dNZ8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!dNZ8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f41dc0b-dbf0-46bf-9b50-b9ecaec39ac9_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>2.1 The Systems View: Context vs. Weights</strong></h3><p>Every large language model operates on two distinct layers of information:</p><ul><li><p><strong>The Weights:</strong> Billions of learned parameters that encode what the model has internalised during training.</p></li></ul><ul><li><p><strong>The Context:</strong> Everything you pass in at runtime&#8212;your prompt, few-shot examples, retrieved documents, and tool specs.</p></li></ul><p>Most techniques, like prompting and RAG, operate solely on the <strong>context</strong>. They inject information at inference time, guiding the model without changing its internal structure.</p><p>Fine-tuning is different. It directly updates the <strong>weights</strong>. It is another round of training on your specific data, using the same architecture and backpropagation, to permanently reshape how the model generalises. You're not just showing it new information; you're changing how it "thinks."</p><h3><strong>2.2 What Fine-Tuning Actually Changes</strong></h3><p>A model's weights do more than store facts; they define how it interprets prompts, routes reasoning, and prioritises outputs under uncertainty. Fine-tuning shifts these core decision boundaries.</p><ul><li><p>It teaches the model to handle variations in phrasing without needing extra prompt instructions.</p></li><li><p>It makes your desired output structure a native behavio<sup>u</sup>r, not just a format to be followed.</p></li><li><p>It embeds your domain-specific logic directly into the model, reducing reliance on bulky few-shot examples.</p></li></ul><p>The resulting model doesn't just respond differently&#8212;it reasons differently. This is what makes fine-tuning a structural change, not a surface patch.</p><h3><strong>2.3 Why Provider Fine-Tuning Isn't Enough</strong></h3><p>The off-the-shelf model you use has already been fine-tuned by its provider for general safety and helpfulness. But that is not the same as optimising for specific, high-stakes business logic, such as:</p><ul><li><p>Emitting outputs that are bound to a strict API contract.</p></li><li><p>Correctly resolving ambiguous terms unique to your product's schema.</p></li><li><p>Refusing to access sensitive data in your internal systems, even when asked politely.</p></li></ul><p>These are not general instruction-following problems; they are <strong>domain adaptation</strong> problems that depend on your data and success criteria. </p><p>Prompting can mask these issues, and retrieval can bridge knowledge gaps, but only fine-tuning can encode this specialised behaviour directly into the model's weights. </p><p>It has a higher upfront cost, but it's the only method that makes the model truly yours.</p><h1><strong>3. The Adaptation Spectrum: Prompting &#8594; Retrieval &#8594; Fine-Tuning</strong></h1><p>Fine-tuning isn't the first tool you reach for, nor should it be. In the application layer, there's a spectrum of techniques engineers use to adapt a language model's behaviour. </p><p>Each one solves a different kind of problem. If you don't understand what each technique does, you risk overengineering the solution or misdiagnosing the issue entirely.</p><p>This section lays out what each technique changes</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l0RH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l0RH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!l0RH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!l0RH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!l0RH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l0RH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:604877,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/165758078?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l0RH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!l0RH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!l0RH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!l0RH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0ea2bb4-e059-4d85-acc5-5b9ea266871d_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>3.1 Prompting: Leveraging What the Model Already Knows</strong></h3><p>Prompting is the cheapest lever and, in many cases, a surprisingly effective one. Even base models that haven't been aligned show signs of instruction-following. Why? Because of how they're pretrained.</p><p>When you pretrain on large corpora of internet text, code, and tutorials, the model learns patterns like:</p><ul><li><p><code>Q:</code> followed by <code>A:</code></p></li><li><p>Function definitions followed by documentation</p></li><li><p>"How do I..." questions followed by step-by-step instructions</p></li></ul><p>So when you write a clean instruction, even to an unaligned model, you're not asking it to be helpful. You're asking it to complete a statistical pattern it's already seen thousands of times. </p><div class="pullquote"><p>This became widely visible with GPT-3, where studies showed that even without gradient updates, zero-shot and few-shot prompts produced relevant answers.</p></div><p><strong>The Catch:</strong> It's a runtime illusion. Prompting gives the model hints; it doesn't rewire its understanding. Failure modes appear quickly: outputs are sensitive to phrasing, behaviour breaks with inconsistent inputs, and structured output like JSON or SQL is brittle.</p><h3><strong>3.2 Retrieval: Giving the Model New Knowledge at Runtime</strong></h3><p>When the model lacks facts about your company, product, or recent policies, prompting isn't enough. This is where retrieval augmentation comes in. </p><p>You build a retrieval layer that fetches relevant documents and injects them into the model's context window at inference time. This doesn't change the model's weights, but it changes what it sees before generating a response.</p><p>This works exceptionally well when:</p><ul><li><p>You need factual accuracy grounded in private or internal data.</p></li><li><p>The query is specific to your business, user, or task.</p></li><li><p>You want outputs to reflect recent changes without retraining the model.</p></li></ul><p><strong>The Catch:</strong> Retrieval provides facts, not skills. The model can see the right information but still uses the wrong tone, fails to follow complex domain-specific logic, and can't reliably generate the structured JSON or API calls your system requires. </p><div class="pullquote"><p>Retrieval fills knowledge gaps, but it doesn't modify how the model uses that knowledge.</p></div><h3><strong>3.3 Alignment: Making the Model Generally Helpful</strong></h3><p>The model you use in production isn't just a pretrained model; it's a <strong>pretrained </strong><em><strong>and</strong></em><strong> aligned model</strong>. </p><p>After pre-training, providers run more fine-tuning phases using SFT and RLHF/DPO to make the model follow instructions and prefer helpful, safe outputs. This is provider-controlled, general-purpose fine-tuning.</p><p>Alignment datasets include common-sense Q&amp;A, summarisation, and dialogue. What they don't include:</p><ul><li><p>Your business logic</p></li><li><p>Your tool APIs</p></li><li><p>Your schema constraints</p></li><li><p>Your data formats</p></li></ul><p><strong>The Catch:</strong> Alignment optimises the model to be broadly helpful, but not precisely correct. The model will try to please the user, but won&#8217;t obey your internal logic.</p><h3><strong>3.4 Fine-Tuning: When Behaviour Has to Live Inside the Model</strong></h3><p>Prompting and retrieval adapt the model from the outside, but don't change how it generalises or shift its internal representations. That's what fine-tuning does.</p><p>When you fine-tune,<sup> </sup>even partially, with methods,<sup> </sup>you are updating the weights. You're modifying the statistical pathways that govern interpretation and retraining the model's instincts.</p><p>Done well, fine-tuning enables:</p><ul><li><p><strong>Structural consistency</strong>: Always outputting a tool call in your exact JSON schema, even if the user request is vague or phrased differently.</p></li><li><p><strong>Domain-native reasoning</strong>: Applying internal business rules or specialised jargon as if they were part of the base training data.</p></li><li><p><strong>Prompt-free formatting</strong>: You don&#8217;t need 20-shot prompts to guide output behaviour, it&#8217;s embedded in the weights.</p></li><li><p><strong>Latency and context savings</strong>: No need to re-explain your needs every time, the model starts closer to your expected output by default.</p></li></ul><div class="pullquote"><p>Where alignment seeks to make a model <strong>broadly helpful</strong>, fine-tuning makes it <strong>specifically reliable</strong>.</p></div><p>It&#8217;s heavier. It needs infrastructure.</p><p>But it&#8217;s the only lever that actually changes the model&#8217;s behaviour permanently, across phrasing, across prompts, across tasks.</p><h1><strong>4. When Is Fine-Tuning the Right Answer?</strong></h1><p>You've seen the full adaptation stack. This brings us to the question every builder eventually hits:</p><p>"Should we tune this model, or are we just not prompting it well enough?"</p><p>Here&#8217;s the simplest way to think about it: Fine-tuning is what you do when your prompts have <strong>plateaued</strong>, retrieval has hit its limits, and the model's general alignment doesn't transfer to your specific task.</p><p>The failure isn't about what the model <em>sees</em>&#8212;it's about how it <em>behaves</em>.</p><p>Let's walk through the five scenarios where fine-tuning becomes a real solution, not just a nice-to-have.</p><h3><strong>4.1 You Need to Reliably Enforce a Strict Structure</strong></h3><p>You're trying to get the model to generate structured data, like JSON objects or API calls. Your prompt engineering is sophisticated, using few-shot examples and schema definitions.</p><p>But you still face constant issues:</p><ul><li><p><strong>Brittleness:</strong> The model works for common cases but breaks on novel inputs.</p></li><li><p><strong>Inconsistency:</strong> It occasionally hallucinates fields, uses the wrong data types, or generates unparsable syntax.</p></li></ul><p>This is a classic limitation of in-context learning. The model is mimicking patterns, not learning the underlying grammar of your format.</p><p>Fine-tuning makes that structure a native capability. By training on hundreds or thousands of valid examples, the model internalises the schema's rules.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!52Bc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!52Bc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png 424w, https://substackcdn.com/image/fetch/$s_!52Bc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png 848w, https://substackcdn.com/image/fetch/$s_!52Bc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!52Bc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!52Bc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png" width="1456" height="485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:307767,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/165758078?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!52Bc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png 424w, https://substackcdn.com/image/fetch/$s_!52Bc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png 848w, https://substackcdn.com/image/fetch/$s_!52Bc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!52Bc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1506c2c6-a28d-466e-b422-5a21feac4357_2400x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>4.2 You Need to Master a Complex, Nuanced Task</strong></h3><p>Your task requires a nuanced understanding that goes beyond general knowledge. For example, classifying user feedback into a highly specific, multi-level taxonomy with over 50 labels.</p><p>With prompting, you hit a performance ceiling because:</p><ul><li><p>The context window can't hold enough examples to cover all the edge cases.</p></li><li><p>The model struggles to differentiate between closely related labels.</p></li></ul><p>This isn't a prompting failure; it's a <strong>task-understanding gap</strong>. The model lacks a deep representation of your specific problem space.</p><p>Fine-tuning closes this gap by training the model on thousands of labelled examples, allowing it to learn the subtle patterns and decision boundaries of your unique taxonomy.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f1Oa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f1Oa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png 424w, https://substackcdn.com/image/fetch/$s_!f1Oa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png 848w, https://substackcdn.com/image/fetch/$s_!f1Oa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!f1Oa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f1Oa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png" width="1456" height="485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:315237,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/165758078?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f1Oa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png 424w, https://substackcdn.com/image/fetch/$s_!f1Oa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png 848w, https://substackcdn.com/image/fetch/$s_!f1Oa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!f1Oa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05d02c5-6ce3-4826-bf3b-44be9c24ee97_2400x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>4.3 Your Domain's Semantics Are Underrepresented</strong></h3><p>Your application operates in a specialised domain like bioinformatics, patent law, or financial compliance. The base model, trained on the general internet, doesn't understand your domain's jargon, entities, and relationships. It treats critical keywords as noise.</p><p>While RAG can retrieve documents, it doesn't teach the model how to <em>interpret</em> them like an expert.</p><p>Fine-tuning teaches the model the <strong>semantics of your domain</strong>. It learns that in a medical context, for instance, certain terms have specific implications that are absent in general text.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6f7u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6f7u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png 424w, https://substackcdn.com/image/fetch/$s_!6f7u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png 848w, https://substackcdn.com/image/fetch/$s_!6f7u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!6f7u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6f7u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png" width="1456" height="485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:291126,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/165758078?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6f7u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png 424w, https://substackcdn.com/image/fetch/$s_!6f7u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png 848w, https://substackcdn.com/image/fetch/$s_!6f7u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!6f7u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17db2481-ab91-4cb9-8404-b53c11d7381a_2400x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>4.4 You Need to Transfer a Capability to Another Language</strong></h3><p>Your product works flawlessly in English, but its performance collapses for your German or Japanese users. The model can translate text, but it fails to apply complex skills, like its ability to follow multi-step instructions or use tools, in the new language.</p><p>This is a <strong>capability-transfer gap</strong>. The complex reasoning skills learned during alignment are often English-centric and don't automatically generalise.</p><p>Fine-tuning on multilingual examples of your specific task is the most effective way to transfer a model's core capabilities across languages.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ddvk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ddvk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png 424w, https://substackcdn.com/image/fetch/$s_!ddvk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png 848w, https://substackcdn.com/image/fetch/$s_!ddvk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!ddvk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ddvk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png" width="1456" height="485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:307767,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/165758078?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ddvk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png 424w, https://substackcdn.com/image/fetch/$s_!ddvk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png 848w, https://substackcdn.com/image/fetch/$s_!ddvk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!ddvk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7873465b-ca60-4425-a04d-33a49367cda1_2400x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>4.5 You Need to Distill a Capability into a Cheaper, Faster Model</strong></h3><p>Your prototype, built on a state-of-the-art model like GPT-4o, works perfectly but is too slow and expensive for production scale. </p><p>When you try to switch to a smaller, faster open-source model, performance plummets.</p><p>This is a common use case for <strong>distillation</strong>, a form of fine-tuning where you train a smaller "student" model on the outputs of a larger "teacher" model.</p><p>You use the teacher model to generate a high-quality, synthetic dataset of thousands of examples for your specific task. </p><p>Then, you fine-tune the smaller student model on this data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MYAw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MYAw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png 424w, https://substackcdn.com/image/fetch/$s_!MYAw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png 848w, https://substackcdn.com/image/fetch/$s_!MYAw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!MYAw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MYAw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png" width="1456" height="485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:329403,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/165758078?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MYAw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png 424w, https://substackcdn.com/image/fetch/$s_!MYAw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png 848w, https://substackcdn.com/image/fetch/$s_!MYAw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!MYAw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fc56e11-e8d1-4823-881f-a014bb752d79_2400x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1><strong>5. When Fine-Tuning Will Waste Your Time</strong></h1><p>Fine-tuning isn't cheap in time, cost, or complexity. And unlike a prompt tweak, its effects aren't easily reversible. Before committing to a fine-tuning project, you must ensure you aren't trying to solve the wrong kind of problem.</p><p>Here are four red flags that should tell you: "This isn't a fine-tuning job&#8212;or at least, not yet."</p><h3><strong>5.1 You Don't Have Enough High-Quality Data</strong></h3><p>Fine-tuning works by adjusting the model's weights based on the patterns in your examples. If those examples are too few, too noisy, or don't accurately represent the behaviour you want, you will be training the model on garbage.</p><p>Common failure modes include:</p><ul><li><p><strong>Overfitting:</strong> With too few examples (e.g., &lt; 500-1000 for many tasks), the model doesn't learn the general skill you're trying to teach. It just memorises the specific examples, failing to generalise to new, unseen inputs.</p></li><li><p><strong>Noise Amplification:</strong> If your data is noisy or inconsistently labelled, the model will faithfully bake that noise directly into its weights, making its behaviour erratic and unreliable.</p></li></ul><p>A critical first step is to validate your task with <strong>In-Context Learning (ICL)</strong>, or few-shot prompting. If you can't get reasonable performance by showing the model a handful of high-quality examples in a prompt, fine-tuning on a larger set of those same examples is unlikely to succeed.</p><h3><strong>5.2 Your Task Relies on Volatile, Fast-Changing Information</strong></h3><p>Fine-tuning is for teaching the model a <strong>persistent skill or style</strong>. It is the wrong tool for teaching knowledge that changes frequently.</p><p>If your use case relies on information that is updated daily or hourly, such as:</p><ul><li><p>Product inventory and pricing</p></li><li><p>Live news or event tracking</p></li><li><p>Real-time user data</p></li></ul><p>...then a fine-tuned model will be perpetually out of date. Each update would require retraining and redeploying the model, creating a massive maintenance overhead.</p><p>This is a classic use case for <strong>Retrieval-Augmented Generation (RAG)</strong>. RAG is designed to provide the model with fresh, volatile information at inference time, separating the model's stable "skills" from the dynamic "knowledge" it needs to act on.</p><h3><strong>5.3 You Have Strict Latency or Deployment Constraints</strong></h3><p>Fine-tuning can impact your deployment architecture and performance budget. Even parameter-efficient methods like LoRA require the entire base model's weights to be loaded into GPU memory for inference.</p><p>This presents a problem if you are deploying in a constrained environment:</p><ul><li><p><strong>Edge/Mobile:</strong> A 7-billion parameter model, even a quantised one, can be too large and slow for on-device applications with tight memory and latency budgets.</p></li><li><p><strong>High-Throughput Services:</strong> If your service needs to handle thousands of requests per second with low latency, the cost of serving a large, fine-tuned model can be prohibitive.</p></li></ul><p>Before fine-tuning, profile your target deployment environment. Sometimes, clever prompting on a smaller, faster model or using a highly optimised API is a better solution than deploying a fine-tuned model yourself.</p><h3><strong>5.4 The Task Demands Immediate Controllability</strong></h3><p>Fine-tuning hardcodes behaviour. If the model develops a flawed or harmful tendency, you can't fix it with a quick prompt change. The flaw is now part of the model's core logic, and fixing it requires a new training and deployment cycle.</p><p>This creates a critical trade-off: <strong>power vs. control</strong>.</p><p>While fine-tuning is powerful for teaching domain knowledge (as we saw in 4.3), it is a high-risk choice for applications where the ability to immediately patch, steer, or disable a behaviour is the top priority.</p><p>This is especially true for:</p><ul><li><p>Customer-facing applications with direct brand exposure.</p></li><li><p>Domains where new failure modes can emerge rapidly (e.g., new types of scams or adversarial attacks).</p></li></ul><p>In these scenarios, keeping logic in the prompt or in external rule engines gives you more immediate control. You can update a guardrail prompt in seconds; a fine-tuning run takes hours or days. If your operational posture requires instant intervention, fine-tuning may introduce an unacceptable level of response lag.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Io_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Io_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!4Io_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!4Io_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!4Io_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Io_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:666394,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/165758078?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Io_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!4Io_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!4Io_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!4Io_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec06353-40c2-403e-8d63-69bbb078b1d4_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1><strong>Wrap-up: The Strategic Edge</strong></h1><p>Fine-tuning isn&#8217;t a magic wand. It&#8217;s a deliberate architectural decision to trade runtime controllability for baked-in, specialised behaviour.</p><p>You now have the framework to make that trade. You know when to pull that lever&#8212;to enforce a strict structure, master a complex task, or distill a capability into a more efficient model. And just as importantly, you know the red flags that signal it's the wrong choice, saving you from wasting time, money, and effort.</p><p>But strategy is only half the equation.</p><p><strong>In Part 2, we go from &#8220;Should I fine-tune?&#8221; to &#8220;How do I fine-tune </strong><em><strong>well</strong></em><strong>?&#8221;</strong></p><p>We&#8217;ll cover the engineering reality of execution:</p><ul><li><p><strong>The Modern Methods:</strong> How to navigate the trade-offs between full fine-tuning, the efficiency of PEFT (LoRA/QLoRA), and the scale of distillation.</p></li><li><p><strong>The Production Pipeline:</strong> A step-by-step walkthrough of the end-to-end workflow, from curating the perfect dataset to safe deployment and monitoring.</p></li><li><p><strong>The Technical Risks:</strong> A guide to the real-world failure modes&#8212;like catastrophic forgetting and silent regressions&#8212;and the engineering discipline required to prevent them.</p></li></ul><p>If Part 1 was about <em>when</em> to do it, Part 2 is about how to do it without breaking everything.</p><h1><strong>References &amp; Further Reading</strong></h1><ul><li><p><a href="https://arxiv.org/abs/2005.14165">Language Models are Few-Shot Learners (GPT-3 paper)</a></p></li><li><p><a href="https://arxiv.org/abs/2201.11903">Chain-of-Thought Prompting Elicits Reasoning</a></p></li><li><p><a href="https://platform.openai.com/docs/guides/fine-tuning">OpenAI Fine-Tuning Guide</a></p></li><li><p><a href="https://ai.meta.com/blog/adapting-large-language-models-llms/">Meta &#8212; Adapting Large Language Models</a></p></li><li><p><a href="https://ai.meta.com/blog/when-to-fine-tune-llms-vs-other-techniques/">Meta &#8212; When to Fine-Tune LLMs vs Other Techniques</a></p></li><li><p><a href="https://arxiv.org/abs/2203.02155">InstructGPT: Aligning Language Models with Human Feedback</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[The Dangerous Thing About AI Hype?]]></title><description><![CDATA[Before the Next Deep Dive, A Message for the Builders.]]></description><link>https://blog.neosage.io/p/the-dangerous-thing-about-ai-hype</link><guid isPermaLink="false">https://blog.neosage.io/p/the-dangerous-thing-about-ai-hype</guid><dc:creator><![CDATA[Shivani Virdi]]></dc:creator><pubDate>Fri, 23 May 2025 02:15:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!72zp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>NeoSage isn&#8217;t shipping a full issue this week.</p><p>Not because there isn&#8217;t enough to write.<br>But because there&#8217;s too much to cut through.</p><p>I&#8217;ve been head-down, curating what comes next &#8212; and I mean really curating. Because I don&#8217;t just want to publish another deep dive. I want every issue to <strong>raise the bar</strong>, sharpen your intuition, and help you think in systems, not soundbites.</p><p>And lately, there&#8217;s been too many of those.</p><p>Everywhere you look, the AI space is on fire &#8212; but not always in a good way.</p><p>You&#8217;ve got founders, CEOs, and VC-backed evangelists sprinting to say the same thing, louder and faster:</p><blockquote><p>&#8220;AI is replacing humans faster than we can adapt.&#8221;<br>&#8220;You don&#8217;t need developers anymore &#8212; just AI.&#8221;<br>&#8220;Build 10x faster, deploy in hours. Vibe code your way to production.&#8221;</p></blockquote><p>These statements are <strong>problematic for several reasons</strong>, and today, I want to walk you through exactly why.</p><p>Because when hype becomes the loudest voice in the room,<br>clarity becomes a responsibility.</p><p>And that&#8217;s what this issue is about.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.neosage.io/subscribe?"><span>Subscribe now</span></a></p><h2>Before we dive in, Meet Nocto</h2><p>Now, before we dive in, there&#8217;s someone I&#8217;ve been meaning to introduce.</p><p>You&#8217;ve probably seen him perched silently at the corner of our visuals &#8212;<br>the quiet observer with far too much caffeine and not enough patience for low-quality takes.</p><p>That&#8217;s <strong>Nocto</strong> &#8212; the NeoSage owl.<br>Cynical. Sharp-eyed. Lives on espresso and questionable humour.<br>Also, the only creature I trust to edit my drafts without hallucinating a product roadmap.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!72zp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!72zp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!72zp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!72zp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!72zp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!72zp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6522415,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/164206862?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!72zp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!72zp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!72zp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!72zp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba00000-dd0e-4238-8676-ef31ac82b53e_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Nocto&#8217;s been around since the early days of transformer papers &#8212; quietly watching the rise, the hype, and the chaos.</figcaption></figure></div><p>He doesn&#8217;t speak often, but when he does, it&#8217;s usually something like:</p><blockquote><p>&#8220;That won&#8217;t scale.&#8221;<br>&#8220;That prompt&#8217;s going to blow up in prod.&#8221;<br>&#8220;Add a failure mode or it&#8217;s just a fantasy.&#8221;</p></blockquote><p>So if you see Nocto lurking around the margins of NeoSage&#8230;<br>Just know he&#8217;s watching the same hype train I am, and rolling his eyes just as hard.<br><br>Say Hi, and let&#8217;s get back! </p><h2>What These Narratives Miss</h2><p>Let&#8217;s talk about what these narratives are actually doing.</p><p>Because statements like</p><blockquote><p>&#8220;AI will replace X% of all professionals by the end of this year,&#8221;<br>or &#8220;You don&#8217;t need developers anymore &#8212; just AI,&#8221;</p></blockquote><p>&#8230;they&#8217;re not just loud headlines.<br>They&#8217;re framing devices &#8212; and they come with consequences.</p><p><strong>First, they create panic.</strong></p><p>If you&#8217;re a developer, a designer, a customer support rep, or anyone whose field is being mentioned in these projections, you&#8217;re not hearing encouragement to upskill or adapt.<br>You&#8217;re hearing: <em>You&#8217;re on the way out.</em></p><p>That doesn&#8217;t help anyone build.<br>It only creates fear and often paralysis.</p><p><strong>Second, they build an overly optimistic picture of what AI can currently do.</strong><br>And I understand why that happens.</p><p>When billions have been invested in a product or platform, the pressure to deliver results often shifts into a pressure to sell <em>the vision</em>.<br>So you sell the potential &#8212; loudly.<br>Even if that potential still requires ten layers of scaffolding to hold up in the real world.</p><p><strong>Third, they shift the focus from </strong><em><strong>how we get there</strong></em><strong> to </strong><em><strong>what will be</strong></em><strong>.</strong></p><p>We stop asking:<br>How do we make AI outputs reliable?<br>What&#8217;s the failure mode here?<br>How do we structure systems that don&#8217;t fall apart in production?</p><p>And instead, we start asking:<br>Will I still have a job next year?</p><p>That&#8217;s not progress. That&#8217;s distraction.</p><p><strong>Fourth &#8212; and this is the one I care about the most &#8212; they oversell the power of speed and cost reduction without ever showing people how to </strong><em><strong>actually</strong></em><strong> tap into it.</strong></p><p>You can&#8217;t just tell people &#8220;AI will 10x your workflow&#8221; and walk away.<br>That&#8217;s not insight &#8212; that&#8217;s marketing.<br>And people with little to no experience end up paying for that gap in time, in technical debt, or in production failures that look good on demo day but collapse under load.</p><p>A few weeks ago, Sam Altman asked a panel audience:</p><blockquote><p>&#8220;How many people here feel smarter than GPT-4?&#8221;</p></blockquote><p><em>(Well... that&#8217;s kind of like asking whether I&#8217;m smarter than a calculator. I mean... anyway.)</em></p><p>But that&#8217;s the kind of framing I&#8217;m talking about.<br>It doesn&#8217;t inform. It doesn&#8217;t equip.<br>It impresses and subtly disempowers.</p><h2>My Core Belief</h2><p>I&#8217;m not against these conversations.<br>I&#8217;m not even against the ambition behind them.</p><p>I&#8217;m a massive proponent of AI.<br>That&#8217;s what NeoSage is all about &#8212; helping you understand how to work with these systems, not just admire them from a distance.</p><p>But what I&#8217;m concerned about is <strong>how we&#8217;re framing the conversation</strong>.</p><p>We talk about what AI might replace.<br>We talk about how fast it can build, ship, and scale.<br>We talk about cost reduction, fewer people, more speed.</p><p>What we don&#8217;t talk about enough is:<br><strong>How to actually use it well.</strong></p><p>Because AI is not magic.<br>It&#8217;s a technology &#8212; a tool &#8212; and like every tool we&#8217;ve ever built,<br>It&#8217;s only as powerful as the person using it.</p><p>The risk isn&#8217;t that people won&#8217;t use AI.<br>The risk is that people will use it <em>wrong</em> &#8212;<br>without knowing the limits, the failure modes, the trade-offs.</p><p>And that gap?<br>It doesn&#8217;t just slow you down.<br>It costs you in time, in quality, in reliability, and in ways that often show up too late.</p><p>So yes &#8212; AI <em>can</em> speed up development,<br><em>can</em> reduce human effort,<br><em>can</em> bring down operational costs.</p><p>But only if you understand what you&#8217;re working with.</p><p>Otherwise, you&#8217;ll pay for it.<br>And not just with money.</p><h3>The Four Pillars of Building with AI &#8212; Responsibly</h3><p>So what should we be saying instead?</p><p>If you're a leader, a founder, a CTO, or an AI builder &#8212;<br>You&#8217;re not just deciding <em>whether</em> to adopt AI.<br>You&#8217;re deciding <em>how</em>, <em>where</em>, and <em>how far</em> to take it.<br>And in a space moving this fast, that decision will either compound value or technical debt.</p><p>Here are four pillars I believe should stay top of mind as you build.</p><h4><strong>1. Expert Intuition Is Not Replaceable</strong></h4><p>At least not with current capabilities &#8212; or until you&#8217;ve built a fully orchestrated, truly autonomous system.<br>AI today can code, write, and generate. But it cannot <em>know</em>.<br>It has no mental model of your product, your users, your trade-offs, or your non-negotiables.</p><p>And until it does, expert oversight isn&#8217;t optional &#8212; it&#8217;s the only thing keeping your velocity from turning into fragility.</p><p>Replace too early, and what you gain in surface speed, you lose in root stability.</p><h4><strong>2. AI Is Not Magic &#8212; It&#8217;s a Tool</strong></h4><p>The mistake isn&#8217;t overestimating AI.<br>It&#8217;s forgetting that every system it touches needs guardrails, grounding, and fallback modes.<br>That&#8217;s not pessimism, that&#8217;s systems thinking.</p><p>If you&#8217;re treating the model as the product,<br>if you&#8217;re shipping prompts as logic,<br>if you&#8217;re trusting generative outputs without evaluation layers &#8212;<br>You&#8217;re not building software. You&#8217;re rolling the dice.</p><h4><strong>3. Security &gt; Speed</strong></h4><p>Every AI product pitch says, &#8220;ship 10x faster.&#8221;<br>But no customer remembers how fast you shipped.<br>They remember when something failed.<br>Or worse, when something leaked.</p><p>As leaders, it&#8217;s easy to prioritise acceleration &#8212;<br>But your real edge isn&#8217;t in being fast.<br>It&#8217;s about being fast <strong>without</strong> compromising trust, traceability, or user safety.</p><p>Cutting corners on plain old security standards in favour of speed isn&#8217;t bold.<br>It&#8217;s shortsighted.</p><h4><strong>4. Systems Are Built on Discipline, Not Hype</strong></h4><p>The best Software systems in production today?<br>They aren&#8217;t magic. They&#8217;re well-architected.</p><p>They&#8217;re layered, observable, retrievable, resilient &#8212;<br>because someone treated them like systems, not stunts.</p><p>And that&#8217;s the job.</p><p>Not to follow the vision.<br>But to build what the vision <strong>requires</strong> &#8212;<br>under the constraints of latency, cost, safety, and scale.</p><p>That&#8217;s what separates hype from infrastructure.<br>And that&#8217;s where the real opportunity lives.</p><p><strong>So if you&#8217;re leading the charge on AI </strong><br>Don&#8217;t just ask what it can do.<br>Ask what it takes to use it <strong>well</strong>.</p><p>Adopting AI is no longer the hard part.<br>Building with it <strong>responsibly</strong>, <strong>robustly</strong>, and <strong>without regrets later</strong> &#8212;<br>That&#8217;s the real work.</p><h2>This wasn&#8217;t a typical NeoSage issue &#8212; by design.</h2><p>There&#8217;s so much noise in this space<br>What we need more of is <strong>context</strong>, <strong>clarity</strong>, and <strong>skin in the game</strong>.</p><p>Because most people don&#8217;t need another LinkedIn post telling them AI is the future.<br>They need someone to show them how to navigate it and build for it, without getting lost in the abstraction.</p><p>That&#8217;s what I&#8217;m trying to do here.<br>That&#8217;s what I&#8217;ll keep doing, issue by issue.</p><p>So next week, we get back to our usual programming.<br>Back to deep dives, frameworks, architecture, and intuition-first explanations.</p><p>But this week?<br>This one needed to be said.</p><p>If this resonated, share it.<br>If it challenges something, sit with it.<br>And if there&#8217;s a builder, leader, or CTO you know who&#8217;s making AI bets right now, send it to them.</p><p>Let&#8217;s raise the bar for how we talk about this space.<br>Because the future won&#8217;t be built by the loudest.<br>It&#8217;ll be built by those who know what they&#8217;re doing.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading NeoSage! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>See you next week.<br>Shivani</strong><br><em>Owl-thor, with Nocto silently judging from the corner</em></p>]]></content:encoded></item><item><title><![CDATA[Inside DeepSeek-R1: A Masterclass in Incentivising Intelligence]]></title><description><![CDATA[What DeepSeek-R1 really teaches us: how to build models that learn, align, and evolve &#8212; without millions of labels.]]></description><link>https://blog.neosage.io/p/inside-deepseek-r1-a-masterclass</link><guid isPermaLink="false">https://blog.neosage.io/p/inside-deepseek-r1-a-masterclass</guid><dc:creator><![CDATA[Shivani Virdi]]></dc:creator><pubDate>Thu, 15 May 2025 23:17:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You&#8217;ve probably seen the benchmarks.</p><p>Open weights. Performance on par with <strong>OpenAI&#8217;s o1 series models</strong></p><p>People were stunned.</p><p>Investors even started questioning OpenAI&#8217;s moat &#8212; stocks dipped.</p><p>But that&#8217;s not what makes DeepSeek-R1 remarkable.</p><p>Not really.</p><p>What actually matters &#8212; and what almost nobody talked about &#8212; is <strong>how</strong> it got there.</p><p>Because DeepSeek-R1 isn&#8217;t just a better-trained model.</p><p>It&#8217;s a <strong>blueprint for how to engineer intelligence</strong> into systems that were never taught what reasoning looks like.</p><p>No massive supervised dataset.</p><p>No army of human annotators.</p><p>No fancy process reward models.</p><p>Just a series of training choices that, when you look closely, form a system-level masterclass in making language models do more than predict tokens.</p><p>In this issue, I&#8217;ll walk you through the exact architecture, training loop, and lessons from the DeepSeek-R1 paper &#8212; not just to admire what they built, but to <strong>understand what we can borrow</strong>.</p><p>Because if your work involves LLMs that need to reason, align, or evolve over time &#8212;</p><p>DeepSeek-R1 isn&#8217;t just a model worth studying.</p><p>It&#8217;s a system worth stealing from.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.neosage.io/subscribe?"><span>Subscribe now</span></a></p><h2>Why DeepSeek-R1 Mattered So Much, So Fast</h2><p>When DeepSeek-R1 dropped, the headlines focused on one thing: performance.</p><blockquote><p>79.8% on AIME 2024.</p><p>97.3% on MATH-500.</p><p>96.3 percentile on Codeforces.</p><p>On par with OpenAI&#8217;s o1-1217 &#8212; OpenAI&#8217;s &#8216;then&#8217; best reasoning model</p></blockquote><p>That alone was enough to cause a stir.</p><div class="pullquote"><p>If you&#8217;re not big on benchmarks, don&#8217;t worry because me neither. Benchmark reports should always be taken with a big rock of salt. (Yes, rock salt. :P)</p></div><p>But what really jolted the industry was the <strong>cost-efficiency</strong> behind those numbers.</p><p>DeepSeek didn&#8217;t just release open weights &#8212; they released <strong>MoE architecture</strong>, <strong>inference-optimised routes</strong>, and a <strong>671B parameter model that activates only 37B per forward pass</strong>.</p><p>The result? Comparable output quality at <strong>a fraction of the inference cost</strong> &#8212; and that <em>did</em> reflect in OpenAI&#8217;s stock price.</p><p>But even that isn&#8217;t the full story.</p><p>What makes DeepSeek-R1 impossible to ignore, especially for engineers, is this:</p><blockquote><p>It wasn&#8217;t trained the way models are usually trained.</p></blockquote><p>There was no massive supervised alignment stage.</p><p>No instruction tuning on millions of curated tasks.</p><p>No handcrafted demonstrations &#8212; just reward structures that made reasoning emerge on its own.</p><p>Instead, DeepSeek-R1 was built to answer a very different question:</p><blockquote><p>Can you train a language model to reason, not by showing it what reasoning looks like, but by rewarding it when it gets it right?</p></blockquote><p>That single bet is what makes this system so relevant.</p><p>Because the pipeline that emerged from it isn&#8217;t just academically novel &#8212; it&#8217;s a practical rethink of <strong>how to get reasoning from a model without incurring massive overhead</strong>.</p><p>And if you&#8217;re in the business of building applied LLM systems &#8212; whether that&#8217;s fine-tuning smaller models, training agents, or aligning behaviour &#8212; that question is <em>your</em> question too.</p><p>So from this point on, we stop looking at R1 as &#8220;a strong open model.&#8221;</p><p>And start looking at it as a <strong>system architecture</strong> &#8212; one that happens to make strong reasoning emerge with lower training burden, lower inference cost, and far better alignment with engineering constraints.</p><p>Let&#8217;s unpack that system.</p><div class="pullquote"><p><strong>Note for the reader:</strong><br>This breakdown has been intentionally kept accessible, not to simplify the work, but to sharpen your intuition. The goal isn&#8217;t just to understand DeepSeek-R1, but to update your mental model so you can take these ideas to the application layer.</p></div><h2>The Core Bet &#8212; Can Reasoning Be Incentivised, Not Taught?</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-y6d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-y6d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!-y6d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!-y6d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!-y6d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-y6d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:517831,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163610419?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-y6d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!-y6d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!-y6d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!-y6d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea1ba72-c727-4f28-a685-167199028879_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most modern LLMs are trained in two broad stages:</p><ol><li><p><strong>Pretraining</strong> &#8212; on massive amounts of raw text, predicting the next token across everything from books to code to forums</p></li><li><p><strong>Post-training / alignment</strong> &#8212; where the model is fine-tuned to be helpful, truthful, and aligned with human intent</p></li></ol><p><strong>If interested</strong> in knowing more about how LLMs are trained, read these issues:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;dc552ba4-0b55-461b-a248-879baa38724f&quot;,&quot;caption&quot;:&quot;Introduction&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How GPTs Are Born: Internet Feeding, Token by Token&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:329528627,&quot;name&quot;:&quot;Shivani Virdi&quot;,&quot;bio&quot;:&quot;Engineering at Microsoft | Simplifying AI for Everyone | Empowering Productivity with Proven Frameworks and Processes&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d15370b-dcd2-4300-be03-cf811f0f45d9_862x862.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-16T14:13:19.022Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.neosage.io/p/how-gpts-are-born-internet-feeding&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:161399912,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:23,&quot;comment_count&quot;:8,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;NeoSage&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8266222-d17f-4639-a529-67ae92f79bb1_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;02ae37e6-9582-4f52-ad74-5492ac04fab8&quot;,&quot;caption&quot;:&quot;Introduction&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How GPTs Learn to Be Helpful&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:329528627,&quot;name&quot;:&quot;Shivani Virdi&quot;,&quot;bio&quot;:&quot;Engineering at Microsoft | Simplifying AI for Everyone | Empowering Productivity with Proven Frameworks and Processes&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d15370b-dcd2-4300-be03-cf811f0f45d9_862x862.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-23T16:22:18.145Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.neosage.io/p/how-gpts-learn-to-be-helpful&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:161930085,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:11,&quot;comment_count&quot;:7,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;NeoSage&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8266222-d17f-4639-a529-67ae92f79bb1_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>That second stage is where most of the nuance comes in &#8212; and where most models start to diverge.</p><p>The typical pipeline looks like this:</p><ul><li><p>Start with a <strong>base model</strong> &#8212; a pretrained LLM that&#8217;s good at next-token prediction, but still brittle or unhelpful in practice</p></li><li><p>Apply <strong>Supervised Fine-Tuning (SFT)</strong> &#8212; feed it carefully curated examples of &#8220;good completions&#8221; for prompts, and nudge it toward copying that behaviour</p></li><li><p>Optionally, add a <strong>reward model</strong> trained on human preferences, then use reinforcement learning (usually PPO) to further optimise</p></li></ul><p>SFT, in particular, became the backbone of alignment strategies because it's simple and data-efficient:</p><p>You just show the model enough &#8220;good completions&#8221; and let it imitate the output.</p><p>But that approach comes with limits.</p><p>Because while it can teach the model what correct answers <em>look like</em>,</p><p>It doesn&#8217;t necessarily help it understand <em>how to think through</em> the problem, especially in domains like math, logic, or program synthesis.</p><p>You end up with models that pattern-match well, but can&#8217;t adapt their process when the pattern changes.</p><p>And that&#8217;s exactly the gap DeepSeek set out to address.</p><p>Instead of building a reasoning model by <strong>showing it what reasoning looks like</strong>,</p><p>they asked a much more interesting question:</p><blockquote><p>What if we just rewarded the model whenever it reasoned correctly &#8212; and let it figure out the rest on its own?</p></blockquote><p>That&#8217;s the bet behind <strong>DeepSeek-R1-Zero</strong>.</p><p>No demonstrations. No handcrafted completions.</p><p>Just a base model, a carefully structured training loop, and a reward signal grounded in outcomes.</p><p>And surprisingly, it worked.</p><p>Here&#8217;s how:</p><p>They weren&#8217;t optimising for final answers alone.</p><p>They were <strong>incentivising process,</strong> behaviour patterns that resemble reasoning:</p><p>longer chains of thought, structured output, internal verification, and the correct final answer.</p><p>That meant two things:</p><ol><li><p>The reward signal had to be <strong>grounded</strong> &#8212; e.g., for math, the model had to output its final answer in a strict format so correctness could be programmatically verified. For code, it had to compile and pass test cases.</p></li><li><p>The reasoning process had to be <strong>detectable,</strong> so they enforced a structured template: <code>&lt;think&gt;</code> for intermediate reasoning, <code>&lt;answer&gt;</code> for final result</p></li></ol><p>This wasn&#8217;t prompting. It wasn&#8217;t fine-tuning on examples.</p><p>It was <strong>incentive engineering</strong> &#8212; shaping the model&#8217;s behaviour by designing a reward system where reasoning becomes the optimal strategy.</p><p>To do this, they used <strong>Group Relative Policy Optimisation (GRPO)</strong> &#8212; a reinforcement learning approach that doesn&#8217;t require a separate critic network.</p><p>Instead of relying on external evaluation, GRPO works by sampling a group of outputs for each question and scoring them relative to one another.</p><p>The model learns by reinforcing whichever outputs perform best &#8212; a kind of internal competition &#8212; without needing labelled comparisons or reward models trained on preferences.</p><p>We&#8217;ll go deeper into how GRPO works in the next section. But the key idea here is this:</p><blockquote><p>DeepSeek didn&#8217;t teach the model how to reason.</p><p>They built a system where reasoning was the best strategy for getting rewarded.</p></blockquote><p>And that shift &#8212; from instruction to incentive &#8212; unlocks a fundamentally different kind of training pipeline.</p><p>If reasoning can emerge through reinforcement alone,</p><p>you&#8217;re no longer limited by how many examples you can label.</p><p>You&#8217;re only limited by <strong>how well you can define success.</strong></p><p>That&#8217;s what makes DeepSeek-R1-Zero so important.</p><p>It&#8217;s not just a training variant.</p><p>It&#8217;s a new way to think about how intelligent behaviour gets built.</p><h2>Inside R1-Zero&#8217;s Training Loop &#8212; How Reinforcement Actually Worked</h2><p>To train DeepSeek-R1-Zero, the team didn&#8217;t start with labelled examples of &#8220;good&#8221; reasoning.</p><p>They started with a <strong>pretrained base model</strong> &#8212; <strong>DeepSeek-V3-Base</strong> &#8212; and no additional supervised data.</p><p>This base model was trained like most foundation models: on next-token prediction over large-scale web data.</p><p>At this stage, it had no alignment, no formatting consistency, and no reasoning skill beyond pattern matching.</p><p>The DeepSeek team didn&#8217;t fine-tune it on curated examples.</p><p>Instead, they designed a <strong>reinforcement learning loop</strong> that rewarded the <em>outcomes</em> of good reasoning and let the model figure out the process on its own.</p><p>This was the core design shift:</p><blockquote><p>Don&#8217;t show the model how to reason. Just define what success looks like and let it discover reasoning as the optimal strategy.</p></blockquote><p>Let&#8217;s walk through how this worked.</p><h3>The Setup: From Base Model to Self-Improving System</h3><p>The reinforcement setup had three main components:</p><ol><li><p>A <strong>prompt dataset</strong> &#8212; questions/tasks covering math, coding, science, and logic</p></li><li><p>A <strong>reward function</strong> &#8212; that could score completions automatically</p></li><li><p>An RL algorithm &#8212; <strong>GRPO (Group Relative Policy Optimisation)</strong></p></li></ol><p>The model generated multiple completions for each prompt, and GRPO was used to update the model toward the better-performing ones.</p><p>But what makes GRPO different, and especially suited for this, is that it <strong>doesn&#8217;t require a critic model</strong>.</p><p>Let&#8217;s unpack that.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RhAp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RhAp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!RhAp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!RhAp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!RhAp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RhAp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:552383,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163610419?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RhAp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!RhAp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!RhAp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!RhAp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ccda136-d398-4d95-9744-7da6e4a3120a_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>GRPO: What Changed and Why It Matters</h3><p>Most reinforcement learning setups for LLMs (like PPO, used in RLHF) rely on a <strong>critic model</strong> &#8212; a second neural network trained to estimate how good an output is.</p><p>That&#8217;s expensive to train, hard to stabilise at scale, and can introduce noise if the critic itself is misaligned.</p><p><strong>GRPO (Group Relative Policy Optimisation)</strong> drops the critic completely.</p><p>Instead, it scores outputs <strong>relative to each other within a group</strong>, using just a reward function &#8212; no second network.</p><p>Here&#8217;s the flow:</p><ol><li><p>For each prompt, the model samples <strong>K outputs</strong></p></li><li><p>Each output is scored with a rule-based reward </p></li><li><p>The <strong>group&#8217;s mean reward</strong> becomes the baseline</p></li><li><p>Each output&#8217;s <strong>advantage</strong> is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A_{jk} = R_{jk} - \\bar{R}_j$&quot;,&quot;id&quot;:&quot;OMDNPAFHUP&quot;}" data-component-name="LatexBlockToDOM"></div><p>where:</p><ul><li><p><code>R_{jk}</code> is the reward for output k in prompt j</p></li><li><p>\bar{R}_j is the mean reward for all K outputs for that prompt</p></li></ul></li></ol><p>This replaces PPO&#8217;s value function with <strong>group-level comparison</strong>.</p><p>And instead of needing a value estimate, GRPO just says:</p><blockquote><p>&#8220;Which completions were better than average?&#8221;</p></blockquote><p>Then it nudges the model to prefer those, using a KL penalty to stay stable.</p><p>The full update loss looks like:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L} = - \\sum_{j,k} \\left( \\frac{\\pi_{\\theta}(a_{jk} | s_j)}{\\pi_{\\theta_{\\text{old}}}(a_{jk} | s_j)} A_{jk} \\right) + \\beta \\sum_j \\text{KL}\\left(\\pi_{\\theta}(\\cdot | s_j) \\| \\pi_{\\theta_{\\text{old}}}(\\cdot | s_j)\\right)&quot;,&quot;id&quot;:&quot;CODPDHIPZZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Why this matters for models like DeepSeek-R1:</p><ul><li><p>&#9989; It&#8217;s <strong>stable</strong> across large batch sizes</p></li><li><p>&#9989; It&#8217;s <strong>cheap</strong> &#8212; no critic to maintain</p></li><li><p>&#9989; It scales &#8212; GRPO was used to train DeepSeekMath with 64 completions per prompt</p></li></ul><p>And most importantly, it works beautifully in domains where <strong>you can verify outputs</strong>, like math and code. That&#8217;s what made it the backbone of DeepSeek-R1-Zero.</p><p>GRPO doesn&#8217;t teach by example.</p><p>It teaches by comparison and lets the model <strong>discover the better path</strong>.</p><h3>The Reward Functions: How Reasoning Was Incentivised</h3><p>DeepSeek-R1-Zero used <strong>two reward signals</strong>, both programmatic and fully automatable:</p><ol><li><p><strong>Accuracy Reward </strong><em>(These examples are for understanding purposes and not sourced <strong>verbatim</strong> from the original paper)</em></p><ul><li><p>For math tasks: reward = 1 if the final answer was correct (e.g., matched a boxed number), 0 otherwise </p></li><li><p>For code: reward = 1 if the code compiled and passed test cases, 0 otherwise</p></li><li><p>For logic/science: multiple-choice answer correctness or rule-based consistency checks</p></li></ul></li><li><p><strong>Format Reward</strong></p><ul><li><p>Every output had to follow a fixed structure:</p><pre><code>&lt;think&gt; reasoning steps &lt;/think&gt;
&lt;answer&gt; final answer &lt;/answer&gt;</code></pre></li><li><p>Outputs that violated the format were scored 0 and ignored</p></li></ul></li></ol><p>This structure wasn&#8217;t decorative &#8212; it was essential.</p><p>The <code>&lt;think&gt;</code> section forced the model to externalise intermediate reasoning.<br>The <code>&lt;answer&gt;</code> section made verification easy and automated.</p><p>Together, these two rewards created a simple loop:</p><ul><li><p>If you think clearly and answer correctly &#8594; you&#8217;re reinforced</p></li><li><p>If you hallucinate, skip steps, or break format &#8594; you&#8217;re ignored or penalised</p></li></ul><p>And with GRPO driving learning, the model slowly evolved to <strong>prefer the reasoning strategies</strong> that led to high scores, even without seeing a single example of what &#8220;good reasoning&#8221; looked like.</p><h3>The Aha Moments: What Emerged in the Process</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!twSV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!twSV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!twSV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!twSV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!twSV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!twSV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:924769,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163610419?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!twSV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!twSV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!twSV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!twSV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff03ae68a-9f80-40c7-9c1e-475296ed1e6f_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is where it gets interesting.</p><p>As training progressed, the model didn&#8217;t just get more accurate.</p><p>It started to behave <strong>as if it understood the value of thinking</strong>.</p><p>In the paper, the authors show examples like this:</p><pre><code><code>&lt;think&gt;
Wait, wait. Wait. Let me try a different method...
&lt;/think&gt;
&lt;answer&gt;
[Correct boxed result]
&lt;/answer&gt;</code></code></pre><p>This wasn&#8217;t cherry-picked.</p><p>This pattern of reevaluation, error checking, and iterative problem solving <strong>emerged</strong> purely from reinforcement.</p><p>The model began:</p><ul><li><p>Taking longer paths to answers</p></li><li><p>Writing self-checking logic</p></li><li><p>Rephrasing its own steps mid-generation</p></li><li><p>Learning <em>how to reason</em> because it was the most reliable way to get reward</p></li></ul><p>One intermediate checkpoint had already achieved:</p><ul><li><p>71% Pass@1 on AIME-2024 (up from 15.6%)</p></li><li><p>86.7% with majority voting &#8212; matching OpenAI&#8217;s o1-0912</p></li></ul><p>All without ever seeing a supervised CoT (Chain of thought) example.</p><p>This wasn&#8217;t just scaling token prediction.</p><p>This was behaviour change &#8212; learned from first principles.</p><h3>Where It Fell Short</h3><p>But R1-Zero wasn&#8217;t usable out of the box.</p><p>Despite its reasoning capability, it had critical flaws:</p><ul><li><p><strong>Poor readability</strong>: reasoning traces were verbose and messy</p></li><li><p><strong>Language mixing</strong>: often switched between English and Chinese mid-output</p></li><li><p><strong>No general instruction following</strong>: it wasn't aligned to be helpful or polite, just to reason</p></li></ul><p>That&#8217;s why DeepSeek-R1 introduced a second-stage cold-start + RL stack.</p><p>But the point was proven: the model didn&#8217;t need instruction tuning to learn how to reason.</p><p>It needed a well-designed <strong>feedback loop</strong>.</p><h3>Why This Loop Matters</h3><p>This training loop gave us the first open proof that:</p><ul><li><p>A model can <strong>develop reasoning behaviours</strong> purely from reinforcement</p></li><li><p>You don&#8217;t need to hardcode thought &#8212; you can <strong>incentivise it</strong></p></li><li><p>GRPO offers a scalable, low-friction alternative to PPO and RLHF-style setups</p></li><li><p>Reasoning isn&#8217;t a dataset problem &#8212; it&#8217;s a system design problem</p></li></ul><p>And for engineers building alignment stacks, agent loops, or low-cost reasoning assistants &#8212;</p><p>That opens a whole new frontier.</p><p>Because now, you don&#8217;t need to start with answers.</p><p>You just need to define the kind of outputs you want and design a loop that rewards getting there.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/p/inside-deepseek-r1-a-masterclass?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">So happy to see you enjoy NeoSage. Share the love and send this to the smartest dev you know</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/p/inside-deepseek-r1-a-masterclass?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.neosage.io/p/inside-deepseek-r1-a-masterclass?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><h2>From R1-Zero to R1 &#8212; Building a System That Aligns</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JGkV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JGkV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!JGkV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!JGkV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!JGkV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JGkV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:361245,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163610419?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JGkV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!JGkV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!JGkV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!JGkV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F684ec915-bae2-4ccc-ba2b-5ccbf909c0d3_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>By the end of R1-Zero, DeepSeek had something rare:</p><p>A model that could reason, without ever being shown how.</p><p>Through reinforcement alone, it had learned to chain thoughts, reevaluate steps, and converge on answers.</p><p>But it couldn&#8217;t present those answers clearly. It couldn&#8217;t follow instructions. And it didn&#8217;t know how to speak with the user in mind.</p><p>It was a prototype for reasoning, not a system ready to deploy.</p><p>The outputs were verbose. The formatting was unstable. The language switched mid-sentence.</p><p>And beyond STEM-style tasks, it struggled to handle general prompts &#8212; writing, summarisation, chat, translation.</p><p>R1-Zero made reasoning emerge.</p><p>The next challenge was: <strong>how do you keep that reasoning and shape it into something useful?</strong></p><p>That&#8217;s what DeepSeek solved with R1.</p><p>But they didn&#8217;t solve it with more of the same.</p><p>They didn&#8217;t stack another RL pass or throw in alignment data midstream.</p><p>They built a <strong>multi-stage refinement pipeline,</strong> where each phase:</p><ul><li><p>Solved a real, traceable failure from the one before</p></li><li><p>Preserved the capabilities that had already emerged</p></li><li><p>Introduced exactly what was needed &#8212; no more, no less</p></li></ul><p>In the next four stages, they transformed a raw reasoner into a structured, general, aligned system, <strong>without breaking the behaviour they had trained from scratch.</strong></p><p>Let&#8217;s break that pipeline down &#8212; one stage at a time.</p><h3>Stage 1: Cold-Start Fine-Tuning</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gX1R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gX1R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!gX1R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!gX1R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!gX1R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gX1R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:671448,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163610419?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gX1R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!gX1R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!gX1R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!gX1R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffbde88-2b0d-4136-a9f2-cf8b528be9a8_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Goal:</strong> Fix readability, enforce output structure, and prevent language mixing.</p><p>R1-Zero&#8217;s outputs followed a basic structure:</p><pre><code><code>&lt;think&gt; reasoning steps &lt;/think&gt;
&lt;answer&gt; final result &lt;/answer&gt;</code></code></pre><p>This worked for reward scoring, but failed in practice. The model&#8217;s completions showed:</p><ul><li><p>Inconsistent or incoherent formatting</p></li><li><p>Excessive verbosity</p></li><li><p>Frequent English&#8211;Chinese language switching</p></li><li><p>No clear, summarised final answer</p></li></ul><p>To stabilise the output, DeepSeek curated a cold-start <strong>supervised</strong> dataset composed of:</p><ul><li><p>Few-shot prompted completions</p></li><li><p>Zero-shot generations</p></li><li><p>Manually refined outputs from R1-Zero</p></li></ul><p>They introduced a new output structure:</p><pre><code><code>&lt;reasoning_process&gt; structured reasoning steps &lt;/reasoning_process&gt;
&lt;summary&gt; user-facing final answer &lt;/summary&gt;
</code></code></pre><p>This format improved outputs by:</p><ul><li><p>Separating internal logic from final messaging</p></li><li><p>Constraining tone and fluency</p></li><li><p>Removing ambiguity in how the model should present conclusions</p></li></ul><p>By showing it labelled dataset of good completions</p><p>The model was fine-tuned briefly on this dataset.</p><p>Not to teach new reasoning, but to create a stable interface between internal CoT and external output.</p><p>This format became the foundation for subsequent reward modelling and scoring.</p><h3>Stage 2: Reasoning-Oriented Reinforcement Learning</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mI7E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mI7E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!mI7E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!mI7E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!mI7E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mI7E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:332897,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163610419?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mI7E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!mI7E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!mI7E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!mI7E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f4465b1-89d4-409e-958d-85fe2788271b_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Goal:</strong> Improve reasoning performance and enforce language consistency using structured RL.</p><p>With output structure stabilised through cold-start fine-tuning, DeepSeek returned to reinforcement learning to strengthen reasoning performance.</p><p>While the model could now follow the <code>&lt;reasoning&gt;</code> and <code>&lt;summary&gt;</code> format, it still exhibited two key issues:</p><ul><li><p><strong>Incomplete reasoning convergence</strong> &#8212; performance on math, coding, and logic tasks had room to improve</p></li><li><p><strong>Language mixing</strong> &#8212; particularly between English and Chinese, which impacted clarity and evaluation</p></li></ul><p>To address both, DeepSeek applied another round of <strong>large-scale reinforcement learning</strong>, using the <strong>same GRPO (Group Relative Policy Optimisation)</strong> algorithm as in R1-Zero.</p><h3>What was done</h3><p>In this stage, the reward function was updated to include two components:</p><ol><li><p><strong>Reasoning Accuracy Reward</strong></p><ul><li><p>Based on whether the final result was correct (e.g., boxed answer correctness in math, compilation and test success in code)</p></li></ul></li><li><p><strong>Language Consistency Reward</strong></p><ul><li><p>Measured by the <strong>proportion of tokens</strong> in the target language</p></li><li><p>Outputs with mixed-language tokens were penalised</p></li></ul></li></ol><p>This reward function was applied over reasoning-intensive tasks &#8212; specifically math, science, code, and logic &#8212; and training continued until convergence on those benchmarks.</p><h3>What this enabled</h3><p>This stage further strengthened the model&#8217;s reasoning ability &#8212; now with stable formatting, improved correctness, and monolingual output &#8212; <strong>without introducing alignment or general-purpose behaviours yet</strong>.</p><p>By reinforcing under tightly defined reward signals and clean output structure, the model was now ready to scale into broader domains.</p><h3>Stage 3: Rejection Sampling + Supervised Fine-Tuning</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7cen!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7cen!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!7cen!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!7cen!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!7cen!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7cen!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:379163,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163610419?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7cen!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!7cen!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!7cen!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!7cen!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf6ae33-7e5b-4510-bbcb-5fc6dc646642_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Goal:</strong> Broaden model capability beyond reasoning tasks, while preserving structure and quality.</p><p>After reinforcement learning in Stage 2, the model demonstrated strong performance on reasoning-heavy benchmarks, such as math, coding, science, and logic.</p><p>But it was still limited to those domains. It lacked general-purpose abilities across:</p><ul><li><p>Writing and role-play</p></li><li><p>Factual question answering</p></li><li><p>Translation</p></li><li><p>Dialogue and open-ended tasks</p></li></ul><p>To expand coverage without compromising reasoning quality, DeepSeek constructed a new supervised dataset, composed of both:</p><ul><li><p><strong>Self-generated high-quality reasoning examples</strong>, and</p></li><li><p><strong>General-task examples</strong> from DeepSeek-V3&#8217;s alignment pipeline</p></li></ul><h3>What was done</h3><p>The new training set included approximately <strong>800K samples</strong>, split as follows:</p><h3>Reasoning Data (~600K)</h3><ul><li><p>Generated by running prompts through the <strong>Stage 2 model checkpoint</strong></p></li><li><p>For each prompt, <strong>multiple completions were sampled</strong></p></li><li><p>Each completion was scored using:</p><ul><li><p><strong>Rule-based rewards</strong> (correctness, format)</p></li><li><p><strong>Judgment models</strong> from DeepSeek-V3</p></li></ul></li><li><p>Only the <strong>highest-rewarded completions</strong> were kept, using <strong>rejection sampling</strong></p></li></ul><h3>Non-Reasoning Data (~200K)</h3><ul><li><p>Reused from DeepSeek-V3&#8217;s SFT pipeline</p></li><li><p>Domains included:</p><ul><li><p>Role-play</p></li><li><p>Factual QA</p></li><li><p>Writing</p></li><li><p>Translation</p></li><li><p>Self-cognition</p></li></ul></li><li><p>CoT was selectively included using prompting or omitted for simpler queries</p></li></ul><h3>Output format handling</h3><ul><li><p>Reasoning examples retained the <code>&lt;reasoning&gt;</code> and <code>&lt;summary&gt;</code> format</p></li><li><p>For non-reasoning tasks, this structure was not always enforced</p></li><li><p>Some factual tasks used only <code>&lt;summary&gt;</code>, and others followed typical chat-style instructions</p></li></ul><p>This flexible formatting ensured that reasoning quality was preserved while adapting outputs to the task type.</p><h3>Training configuration</h3><ul><li><p>The combined dataset was used to fine-tune the model for <strong>two epochs</strong></p></li><li><p>No additional alignment rewards or reinforcement were introduced at this stage</p></li><li><p>The goal was to solidify generalisation while maintaining structured output for reasoning tasks</p></li></ul><h3>What it enabled</h3><p>By combining curated self-generated reasoning traces with diverse, human-aligned general tasks, this stage produced a model that could:</p><ul><li><p>Reason deeply</p></li><li><p>Communicate fluently</p></li><li><p>Generalise across prompt styles and domains</p></li></ul><p>And it did so <strong>without erasing</strong> the carefully reinforced behaviours from prior stages.</p><p>The next step was to align it with helpfulness and safety under real-world constraints.</p><h3>Stage 4: Reinforcement for Alignment and Safety</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!da8b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!da8b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!da8b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!da8b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!da8b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!da8b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:310988,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163610419?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!da8b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!da8b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!da8b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!da8b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b1e6dd-41e9-406d-84b0-8a0df0f52a39_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Goal:</strong> Align behaviour across general scenarios using reward models for helpfulness and harmlessness.</p><p>By the end of Stage 3, the model could reason fluently and generalise across domains, but it still lacked behavioural alignment in subjective and open-ended tasks.</p><p>It wasn&#8217;t reliably helpful. It didn&#8217;t avoid unsafe completions. It didn&#8217;t consistently reflect human intent in tasks like summarisation, chat, or instruction following.</p><p>To address this, DeepSeek introduced a final round of reinforcement learning, focused on <strong>alignment</strong>.</p><h3>What was done</h3><p>DeepSeek applied additional RL training using a new set of reward signals.</p><h3>For reasoning tasks:</h3><ul><li><p>The model continued to receive <strong>rule-based rewards</strong>, as in previous stages</p></li></ul><h3>For general tasks:</h3><ul><li><p>DeepSeek used <strong>reward models from DeepSeek-V3</strong> to capture alignment signals:</p><ul><li><p><strong>Helpfulness</strong> &#8212; evaluated over the <code>&lt;summary&gt;</code> portion</p></li><li><p><strong>Harmlessness</strong> &#8212; evaluated over the entire response</p></li></ul></li></ul><p>From the paper:</p><blockquote><p>&#8220;For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios.&#8221; (&#167;2.3.4)</p></blockquote><p>While the paper doesn&#8217;t specify how these reward models were trained, it makes clear that they were used to <strong>model human preferences</strong> for prompts where correctness or structure alone couldn&#8217;t define quality.</p><p>Prompts were drawn from a diverse distribution, and each was evaluated using the appropriate reward signal, depending on whether it was a reasoning or general instruction-following task.</p><h3>What it enabled</h3><p>This final reinforcement phase aligned the model&#8217;s outputs with human expectations, making it more helpful, more appropriate, and more consistent across real-world use cases.</p><p>With this step, DeepSeek-R1 became not just a reasoner, but a usable system that combined logical capability with structured communication, generalisation, and safety.</p><h2>Distilling Reasoning &#8212; Teaching Smaller Models to Think</h2><p><strong>Goal:</strong> Transfer R1&#8217;s reasoning ability into smaller, open-weight models using SFT alone.</p><p>Once DeepSeek-R1 had been trained, the next question was:</p><p>Can its reasoning capability be <strong>transferred,</strong> not just deployed?</p><p>Instead of running expensive RL loops on every downstream model, DeepSeek explored whether smaller base models could be taught to reason <strong>by learning from R1&#8217;s behaviour directly</strong>.</p><p>This wasn&#8217;t about copying parameters &#8212; it was about <strong>teaching via output</strong>.</p><p>And it worked.</p><h3>What was done</h3><p>DeepSeek used the <strong>~800K dataset</strong> created in Stage 3 &#8212; made up of high-quality reasoning and general-task examples &#8212; to distil R1 into a new series of models.</p><p>They fine-tuned several base models using <strong>supervised learning only</strong> (no reinforcement):</p><ul><li><p><strong>Qwen2.5 series</strong>: 1.5B, 7B, 14B, 32B</p></li><li><p><strong>Llama3 series</strong>: 8B, 70B</p></li></ul><p>Each of these base models was <strong>fine-tuned using the R1 dataset</strong>, capturing both its structured reasoning and generalisation behaviours.</p><blockquote><p>&#8220;We fine-tune several dense models&#8230;using the reasoning data generated by DeepSeek-R1.&#8221; &#8212; &#167;2.4</p></blockquote><p>No reinforcement learning was applied during distillation.</p><p>The distilled models learned entirely by <strong>mimicking R1&#8217;s output,</strong> supervised fine-tuning on prompts and completions.</p><h3>What emerged</h3><p>The results showed that <strong>R1&#8217;s reasoning capability could be transferred</strong>, even without re-running the RL loop.</p><ul><li><p><strong>DeepSeek-R1-Distill-Qwen-14B</strong> outperformed <strong>QwQ-32B-Preview</strong></p></li><li><p><strong>DeepSeek-R1-Distill-Qwen-32B</strong> achieved:</p><ul><li><p><strong>72.6%</strong> on AIME 2024</p></li><li><p><strong>94.3%</strong> on MATH-500</p></li><li><p><strong>57.2%</strong> on LiveCodeBench</p></li></ul></li><li><p>These models <strong>surpassed o1-mini</strong> on several reasoning-heavy benchmarks</p></li></ul><p>This proved that reasoning, once made emergent in a larger model, could be <strong>replicated downstream</strong>, even in smaller dense architectures.</p><h3>What it means for builders</h3><p>This distillation loop wasn&#8217;t just about compressing size &#8212; it was about compressing <strong>capability</strong>.</p><p>It showed that:</p><ul><li><p>Small models <strong>can</strong> reason</p></li><li><p>But only if the teacher model learned to reason first</p></li><li><p>Reinforcement learning isn&#8217;t always scalable, but its outcomes <strong>can be scaled</strong> through careful distillation</p></li></ul><p>For any builder working on LLMs with limited compute, this changes the calculus.</p><p>You don&#8217;t need to start with a reasoning-capable small model.</p><p>You need a good teacher.</p><p>And if the teacher is something like R1, you might only need supervised fine-tuning to get very far.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">If this made your mental model sharper, there&#8217;s more where that came from. Come become a Sager &#10083;&#65039;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Final Mental Models &#8212; What This Issue Leaves You With</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DEML!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DEML!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!DEML!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!DEML!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!DEML!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DEML!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:512752,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163610419?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DEML!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!DEML!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!DEML!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!DEML!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6f902cb-15c4-461e-96ff-a12f321c29a9_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>You can train reasoning without examples, but only if you can score it.</strong></p><p>R1-Zero proved this. The constraint wasn&#8217;t data &#8212; it was verifiability. Reward mattered more than supervision.</p></li><li><p><strong>You can't reinforce cleanly until your outputs are readable.</strong></p><p>Cold-start SFT wasn&#8217;t for alignment &#8212; it was to create a trainable structure. Format isn&#8217;t UX. It&#8217;s part of the loop.</p></li><li><p><strong>Language consistency is a rewardable trait, not a hardcoded switch.</strong></p><p>R1 didn&#8217;t block multilingual output. It made consistency the path to reward. That design generalises.</p></li><li><p><strong>Distillation works only when the source system has the right behaviours.</strong></p><p>No small model figured it out from scratch. The ones that worked copied output from a pipeline that already had reasoning baked in.</p></li><li><p><strong>Strong models aren't trained once. They're debugged in passes.</strong></p><p>Each stage in R1 fixed something broken by the last, without losing what worked. That&#8217;s the real blueprint.</p></li></ul><h2>References &amp; Further Reading</h2><ul><li><p><a href="https://arxiv.org/abs/2501.12948">DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning</a></p></li><li><p><a href="https://aiengineering.academy/LLM/TheoryBehindFinetuning/GRPO/">The Theory Behind GRPO (Group Relative Policy Optimization)</a></p></li><li><p><a href="https://huggingface.co/blog/deep-rl-ppo">Understanding PPO (Proximal Policy Optimization)</a></p></li><li><p><a href="https://deepmind.google/discover/blog/alphago-zero-starting-from-scratch/">AlphaGo Zero: Starting From Scratch</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[The Engineer’s Guide to RAG]]></title><description><![CDATA[Make Your Dumb Model Useful: A Dead-Simple Guide to Retrieval-Augmented Generation]]></description><link>https://blog.neosage.io/p/the-engineers-guide-to-rag</link><guid isPermaLink="false">https://blog.neosage.io/p/the-engineers-guide-to-rag</guid><dc:creator><![CDATA[Shivani Virdi]]></dc:creator><pubDate>Wed, 07 May 2025 20:56:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Modern LLMs are powerful &#8212; but they&#8217;re also static.</p><p>Once a model is trained, it can&#8217;t learn anything new.</p><p>No internet. No updates. No awareness of your internal docs, customer tickets, or product features.</p><p>That&#8217;s because language models aren&#8217;t knowledge bases.</p><p>They don&#8217;t &#8220;look up&#8221; information.</p><p>They predict the next token based on patterns seen during training, and that training ended months ago.</p><p>Yet in real-world systems, most queries <strong>aren&#8217;t about abstract language patterns</strong>.</p><p>They&#8217;re about <strong>your data</strong>.</p><ul><li><p>What&#8217;s our refund policy?</p></li><li><p>What did the customer say in the last chat?</p></li><li><p>What does this API do?</p></li></ul><p>These are context-dependent questions.</p><p>And answering them requires injecting the right context at the right time.</p><p>That&#8217;s where <strong>Retrieval-Augmented Generation (RAG)</strong> comes in.</p><p>This issue is your technical blueprint.</p><p>We&#8217;ll walk through:</p><ul><li><p>Why static LLMs fall short</p></li><li><p>Why fine-tuning isn&#8217;t always the fix</p></li><li><p>And how RAG lets your model access fresh, dynamic, grounded context, without touching a single weight</p></li></ul><p>By the end, you won&#8217;t just understand RAG.</p><p>You&#8217;ll know when, how, and why to use it in production.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aJQ_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aJQ_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!aJQ_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!aJQ_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!aJQ_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aJQ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:550897,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163055255?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aJQ_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!aJQ_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!aJQ_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!aJQ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb128796e-ec97-45fa-b934-057b68b8e796_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.neosage.io/subscribe?"><span>Subscribe now</span></a></p><h2>The Real Problem</h2><p>LLMs aren&#8217;t dynamic systems.</p><p>They&#8217;re static functions &#8212; mapping input tokens to output tokens based on frozen training data.</p><p>This has two major consequences for real-world applications:</p><h3>1. <strong>They can&#8217;t access private or real-time information.</strong></h3><p>Your model might be brilliant at writing SQL &#8212;</p><p>But it knows nothing about your schemas, tables, or naming conventions.</p><p>It might be great at summarising &#8212;</p><p>But it can&#8217;t summarise your product docs if it&#8217;s never seen them.</p><h3>2. <strong>They hallucinate confidently when they don&#8217;t know.</strong></h3><p>LLMs are next-token predictors.</p><p>When they lack relevant context, they don&#8217;t say &#8220;I don&#8217;t know.&#8221;</p><p>They interpolate. And that often leads to fabricated answers, which look fluent but fail under scrutiny.</p><p>This isn&#8217;t a bug &#8212; it&#8217;s a design constraint.</p><p>A pretrained model is a static snapshot.</p><p>If your application needs current, personalised, or proprietary knowledge, you need to <em>pipe that knowledge in at inference time</em>.</p><p>That&#8217;s the core challenge RAG is designed to solve.</p><p>But before we get to RAG, let&#8217;s zoom out and explore all the ways builders try to solve this data gap.</p><h2>Why Prompting and Fine-Tuning Can&#8217;t Solve the Knowledge Gap</h2><p>Once you realise your model doesn&#8217;t know your product, your user history, or your internal docs, the next question is:</p><p><strong>How do we teach it?</strong></p><p>There are multiple strategies to work with LLMs in the application layer, two of which are:</p><p><strong>Prompt Engineering</strong> and <strong>Fine-Tuning.</strong></p><p>Both can be powerful, but neither truly solves the problem we&#8217;re dealing with:</p><blockquote><p>Giving a frozen model access to dynamic, user-specific, or time-sensitive knowledge.</p></blockquote><p>Let&#8217;s step through them.</p><h3>1. <strong>Prompt Engineering &#8212; Helpful, But Limited</strong></h3><p>Prompting is about <strong>shaping the model&#8217;s behaviour</strong> at inference time.</p><p>You&#8217;re not teaching it new facts &#8212; you&#8217;re teaching it <em>how to respond</em> based on what it already knows.</p><p>Prompt engineering is useful for:</p><ul><li><p>Formatting answers</p></li><li><p>Steering tone and voice</p></li><li><p>Guiding reasoning (e.g. Chain-of-Thought, ReAct)</p></li><li><p>Enforcing structure (e.g. JSON output, few-shot examples)</p></li></ul><p>But here&#8217;s the core limitation:</p><blockquote><p>Its primary role is to guide how the model behaves &#8212; any data you add to the prompt improves response quality, but sourcing that data isn&#8217;t what prompt engineering solves.</p></blockquote><p>It might be easy to confuse prompt engineering with prompt augmentation.</p><p>But adding new context into the prompt, like search results or documentation snippets, is a separate step.</p><p>That&#8217;s not prompt <em>engineering</em>.</p><p>That&#8217;s <strong>prompt augmentation</strong> &#8212; and <strong>that&#8217;s what RAG is built to automate.</strong></p><p>So while prompt engineering improves fluency and structure, it does <strong>nothing</strong> for grounding the model in <em>your</em> data.</p><h3>2. <strong>Fine-Tuning &#8212; Powerful, But Inflexible</strong></h3><p>Fine-tuning is about modifying the model&#8217;s weights.</p><p>You take a base model and train it further on new examples &#8212; either task-specific, domain-specific, or instruction-style.</p><p>This helps in scenarios like:</p><ul><li><p>Teaching the model legal or medical terminology</p></li><li><p>Improving performance on repetitive, structured workflows</p></li><li><p>Adapting to company-specific language or formats</p></li></ul><p>But in the context of our core problem &#8212; <strong>giving a model access to live, evolving, or user-specific data</strong> &#8212; fine-tuning has major limitations:</p><ul><li><p>&#10060; <strong>It&#8217;s slow and costly</strong> &#8212; requires GPUs, training infra, and QA cycles</p></li><li><p>&#10060; <strong>It&#8217;s brittle</strong> &#8212; every update means retraining or risking drift</p></li><li><p>&#10060; <strong>It&#8217;s static</strong> &#8212; the model remains locked after each fine-tune</p></li><li><p>&#10060; <strong>It&#8217;s inflexible</strong> &#8212; different users or contexts need different versions</p></li></ul><p>Fine-tuning is best when your knowledge is <strong>stable</strong> and your tasks are <strong>narrow</strong>.</p><p>But it falls apart when you want a model to respond to:</p><ul><li><p>&#8220;What&#8217;s the latest version of our API docs?&#8221;</p></li><li><p>&#8220;What did this customer say in their last ticket?&#8221;</p></li><li><p>&#8220;What changed in the HR policy last week?&#8221;</p></li></ul><blockquote><p>That&#8217;s not a training problem.</p><p>That&#8217;s a <strong>retrieval</strong> problem.</p></blockquote><p>And that brings us to the solution this issue is all about &#8212; one that doesn&#8217;t update the model at all, but updates what the model <em>sees</em> at runtime.</p><p>Let&#8217;s talk about RAG.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0tAj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0tAj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!0tAj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!0tAj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!0tAj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0tAj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:225184,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163055255?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0tAj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!0tAj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!0tAj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!0tAj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb64f8a1b-678f-4bff-9811-206116efa8ed_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>So What <em>Is</em> RAG?</h2><p>At its core, <strong>Retrieval-Augmented Generation (RAG)</strong> is a simple but powerful pattern:</p><blockquote><p>Instead of retraining the model, you retrieve relevant information from an external source and inject it into the prompt &#8212; at inference time.</p></blockquote><p>That&#8217;s it.</p><p>No gradient updates.</p><p>No fine-tune cycles.</p><p>Just smarter input.</p><p>Here&#8217;s the distinction that matters:</p><ul><li><p>The model <strong>remains frozen</strong> &#8212; it still runs the same next-token prediction function.</p></li><li><p>What changes is the <strong>context window</strong>:</p><p>You augment it with fresh, task-relevant knowledge pulled from a database, knowledge base, or internal document store.</p></li></ul><p>In other words:</p><blockquote><p>RAG doesn't teach the model.</p><p>It <strong>feeds</strong> it better inputs &#8212; right when it needs them.</p></blockquote><p>This makes RAG fundamentally different from:</p><ul><li><p><strong>Fine-tuning</strong> (changes the model)</p></li><li><p><strong>Prompt engineering</strong> (tweaks behaviour)</p></li><li><p><strong>Tool use</strong> (delegates tasks)</p></li></ul><p>RAG treats the model as a black box and solves the problem outside it.</p><p>It shifts the system design from <strong>"How do I modify the model?"</strong> to</p><p><strong>"How do I retrieve and inject the right context before the model answers?"</strong></p><p>This single shift unlocks:</p><ul><li><p>Real-time updates without retraining</p></li><li><p>Personalisation per user/session</p></li><li><p>Seamless integration of internal knowledge</p></li></ul><p>But it also introduces a new bottleneck:</p><p>Your model is now only as good as what you retrieve and what you stuff into the context window.</p><blockquote><p>RAG moves the complexity from training to retrieval.</p><p>That&#8217;s not a simplification &#8212; it&#8217;s a <strong>re-architecture.</strong></p></blockquote><h2>What RAG Really Does</h2><p>At its simplest, RAG (Retrieval-Augmented Generation) does three things:</p><ol><li><p><strong>Retrieves</strong> the most relevant data based on your question</p></li><li><p><strong>Augments</strong> the LLM&#8217;s prompt with that data</p></li><li><p><strong>Generates</strong> a grounded response using both the query and retrieved context</p></li></ol><blockquote><p>You&#8217;re not teaching the model anything new.</p><p>You&#8217;re giving it just enough information to answer the question <em>as if</em> it knew.</p></blockquote><p>That&#8217;s it.</p><p>Imagine ChatGPT &#8212; but before it responds, you hand it a Post-it Note saying:</p><p>&#8220;By the way, here&#8217;s what our refund policy says.&#8221;</p><p>And then it writes the answer. That&#8217;s RAG.</p><h2>The Retrieval-Augmented Loop</h2><p>Now let&#8217;s step through what&#8217;s actually happening under the hood.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r8nn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r8nn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!r8nn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!r8nn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!r8nn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r8nn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:476061,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163055255?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r8nn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!r8nn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!r8nn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!r8nn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07132cba-cf1f-4e9d-8780-6f1aae5b5ccd_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Step 1: <strong>User Query</strong></h3><p>The user asks a question in plain language:</p><blockquote><p>&#8220;What&#8217;s our refund policy for digital products?&#8221;</p></blockquote><p>At this point, the model on its own doesn&#8217;t have a clue.</p><p>It wasn&#8217;t trained on your policy docs.</p><p>So instead of letting it hallucinate &#8212;</p><p>we retrieve the answer from a trusted source.</p><h3>Step 2: <strong>Embed the Query</strong></h3><p>We convert the user&#8217;s query into a <strong>vector</strong> &#8212; a dense numerical representation that captures <strong>semantic meaning</strong>.</p><p>This is called an <strong>embedding</strong>.</p><ul><li><p>The sentence <em>&#8220;refund for digital orders&#8221;</em></p><p>might produce a vector like <code>[0.22, -0.87, 1.03, ...]</code> &#8212; 768+ numbers long.</p></li><li><p>Another sentence like <em>&#8220;return policy for ebooks&#8221;</em> would produce something nearby in vector space.</p></li></ul><p>These vectors aren&#8217;t based on keywords &#8212; they&#8217;re based on <strong>meaning</strong>.</p><blockquote><p>Embeddings let us match &#8220;What&#8217;s your refund policy?&#8221;</p><p>to a sentence that says, &#8220;Customers can request a refund within 7 days of purchase.&#8221;</p></blockquote><p>That&#8217;s how semantic search works &#8212; it&#8217;s <strong>meaning-based</strong>, not string-based.</p><h3>Step 3: <strong>Vector Search Over Your Corpus</strong></h3><p>We now search this query vector against a <strong>vector database</strong>, like <a href="https://weaviate.io/">Weaviate</a>, <a href="https://qdrant.tech/">Qdrant</a>, <a href="https://www.pinecone.io/">Pinecone</a>, or <a href="https://ai.meta.com/tools/faiss/">FAISS</a>.</p><p>But this database doesn&#8217;t store raw documents.</p><p>It stores <strong>pre-chunked, pre-embedded pieces of your knowledge base</strong>, such as:</p><ul><li><p>Individual help articles</p></li><li><p>Paragraphs from your product manual</p></li><li><p>Snippets of legal policy text</p></li><li><p>Past conversations or support threads</p></li></ul><p>Each one is already embedded as a vector.</p><blockquote><p>Now we calculate which of those vectors are closest to our query in vector space.</p></blockquote><p>This is where &#8220;top-K retrieval&#8221; happens. We fetch the <strong>K most semantically similar</strong> pieces of content.</p><h3>Step 4: <strong>Return Top-K Chunks</strong></h3><p>Let&#8217;s say your query embedding matches 3 chunks closely:</p><ol><li><p>&#8220;Refunds for digital products must be requested within 7 days.&#8221;</p></li><li><p>&#8220;Refund requests can be submitted via dashboard or email.&#8221;</p></li><li><p>&#8220;Physical products have a 30-day return window.&#8221;</p></li></ol><p>These chunks are returned, often with scores.</p><p>In basic setups, we stop here.</p><p>But in smarter systems, we might apply:</p><ul><li><p><strong>Re-ranking</strong>: to push the most relevant one to the top</p></li><li><p><strong>Filtering</strong>: to remove irrelevant ones</p></li><li><p><strong>Scoring models</strong>: to judge answerability</p></li></ul><p>More on those in a minute.</p><h3>Step 5: <strong>Inject Into Prompt</strong></h3><p>These retrieved chunks are then formatted and <strong>injected directly into the LLM&#8217;s prompt</strong>.</p><p>It might look like this:</p><pre><code><code>Context:
1. Refunds for digital products must be requested within 7 days.
2. Refund requests can be submitted via dashboard or email.

Question:
What&#8217;s our refund policy for digital products?</code></code></pre><p>To the model, this is just part of the input.</p><p>It has no idea it came from retrieval &#8212; it just treats it like any other text.</p><h3>Step 6: <strong>LLM Responds Grounded in That Context</strong></h3><p>The LLM reads the entire prompt &#8212; your system instructions, the context chunks, and the query.</p><p>Then it does what it always does:</p><p><strong>predict the next most likely tokens.</strong></p><p>But now, because the context window contains the right information, the response is grounded:</p><blockquote><p>&#8220;Customers can request a refund for digital products within 7 days of purchase. You can do this via dashboard or by email.&#8221;</p></blockquote><p>No hallucination. No guessing.</p><p>Just answering with what you gave it.</p><h2>And in Smarter RAG Systems?</h2><p>The flow stays the same, but we enhance individual steps to boost relevance and reliability.</p><p>Advanced RAG systems often include:</p><ul><li><p><strong>Re-ranking</strong>: Using a second model (e.g. cross-encoder) to rescore and reorder the top-K chunks</p></li><li><p><strong>Query rewriting</strong>: Transforming vague or underspecified user queries into more precise ones</p></li><li><p><strong>Chunk scoring</strong>: Assessing how well a chunk answers the question before injecting it</p></li><li><p><strong>Context pruning</strong>: Removing low-value or redundant content to save tokens</p></li><li><p><strong>Routing models</strong>: Choosing between different knowledge sources, agents, or workflows dynamically</p></li></ul><blockquote><p>These are not optional tricks &#8212; they&#8217;re often what separates production-ready RAG systems from toy demos.</p></blockquote><h2><strong>Data Is God</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8jOf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8jOf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!8jOf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!8jOf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!8jOf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8jOf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:357916,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163055255?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8jOf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!8jOf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!8jOf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!8jOf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8845ae6-c316-44a9-b4cc-9f429b86b65d_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Once you understand how RAG works, it&#8217;s tempting to think the model is doing the heavy lifting.</p><p>But here&#8217;s the truth:</p><blockquote><p>The model just fills in blanks.</p><p>The real work happens <em>before</em> the prompt is ever built.</p></blockquote><p>If your retrieval layer is weak, your output will be wrong, no matter how smart your LLM is.</p><p>That&#8217;s why in production-grade RAG, <strong>preprocessing and ingestion</strong> are the <em>hardest and most critical steps</em>.</p><h3>What Makes or Breaks a RAG System?</h3><p>Let&#8217;s rewind the loop from earlier:</p><p>Your user asks a question.</p><p>The retrieval system tries to find the most relevant information.</p><p>If it fails, the model fails.</p><p>So what decides whether retrieval succeeds?</p><p><strong>How you structure and prepare the data.</strong></p><p>This is the part no demo talks about.</p><h2>Why Indexing Isn&#8217;t Enough &#8212; You Need Ingestion</h2><p>Most people think:</p><blockquote><p>&#8220;We&#8217;ll just chunk our docs and push them into Pinecone.&#8221;</p></blockquote><p>That&#8217;s not ingestion. That&#8217;s dumping.</p><p>Proper ingestion is <strong>curation, segmentation, and semantic structuring</strong>.</p><p>You need to:</p><ul><li><p>Clean the content (remove footers, junk text, irrelevant sections)</p></li><li><p>Split intelligently (not just every 500 characters)</p></li><li><p>Preserve relationships (e.g. question + answer, section + header)</p></li><li><p>Tag metadata (source, author, timestamp, type)</p></li><li><p>Embed using the <em>right</em> model (some are better for short queries, others for long-form)</p></li></ul><h3>Chunking: The Hidden Minefield</h3><p>Most RAG failures come down to bad chunking.</p><p>If your chunk is:</p><ul><li><p>Too long &#8594; it never gets retrieved</p></li><li><p>Too short &#8594; lacks meaningful context</p></li><li><p>Split mid-sentence &#8594; loses meaning</p></li><li><p>Contains dense code/docs &#8594; model can&#8217;t parse structure</p></li></ul><p>&#8230;you&#8217;re injecting garbage into the prompt.</p><p>And remember:</p><blockquote><p>LLMs don&#8217;t reason over your entire corpus.</p><p>They only see the few chunks you retrieved &#8212; in a window capped by token limits.</p></blockquote><p>If <em>those</em> chunks are bad, it&#8217;s over.</p><h3>Embeddings: Not All Are Equal</h3><p>The purpose of an embedding model is simple, but critical:</p><blockquote><p>To map semantically related inputs close together in vector space, so that a query retrieves meaningfully relevant chunks.</p></blockquote><p>But <strong>semantic relationships aren&#8217;t fixed</strong> &#8212; they shift with <strong>domain, task, and context</strong>.</p><p>In general-purpose domains, off-the-shelf models like <a href="https://platform.openai.com/docs/guides/embeddings">OpenAI&#8217;s </a><em><a href="https://platform.openai.com/docs/guides/embeddings">text-embedding-3-small</a> </em>might work:</p><ul><li><p>&#8220;Refund policy&#8221; and &#8220;money-back guarantee&#8221; might be embedded closely</p></li><li><p>&#8220;Cancel subscription&#8221; and &#8220;stop membership&#8221; land nearby</p></li></ul><p>But in your company&#8217;s knowledge base?</p><ul><li><p>&#8220;Workflow&#8221; might mean <em>approval rules</em> in legal, <em>automation</em> in ops, or <em>DAGs</em> in engineering</p></li><li><p>&#8220;Runbook&#8221; could refer to <em>on-call procedures</em> or <em>ML model deployment steps</em></p></li></ul><blockquote><p>These distinctions don&#8217;t exist in general language models &#8212; and they won&#8217;t be captured in their embeddings.</p></blockquote><p>That&#8217;s when off-the-shelf breaks down.</p><p>To retrieve the right chunks, your embedding space needs to reflect your world, not just the internet&#8217;s.</p><p>And that&#8217;s where <strong>fine-tuned embedding models</strong> come in:</p><ul><li><p>Aligned to your jargon, naming conventions, and relationships</p></li><li><p>Trained to treat &#8220;access token&#8221; and &#8220;JWT&#8221; as close, if that&#8217;s how your org writes</p></li><li><p>Able to embed meaning that&#8217;s invisible to a model trained on Stack Overflow and Wikipedia</p></li></ul><p>So yes, vector search works.</p><p>But without embedding models that understand your context, you&#8217;re just retrieving based on <em>someone else&#8217;s semantics.</em></p><p>And that&#8217;s where most RAG pipelines silently fail.</p><h3>Metadata + Filtering = Precision</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VNEQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VNEQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!VNEQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!VNEQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!VNEQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VNEQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:337698,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163055255?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VNEQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!VNEQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!VNEQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!VNEQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42812e42-09a9-46f0-ba87-944ef94aad3d_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>RAG gets exponentially better when you add:</p><ul><li><p>Document-level metadata (source, product, region, date)</p></li><li><p>Filters to narrow down scope (e.g., &#8220;only look at API docs&#8221;)</p></li><li><p>Hierarchical indexing (parent-child chunking with recall context)</p></li></ul><p>Why?</p><p>Because relevance isn&#8217;t always semantic. Sometimes it&#8217;s structural.</p><blockquote><p>&#8220;Refund policy&#8221; may match 20 pages &#8212;</p><p>But only the one from 2024, authored by Legal, is correct.</p></blockquote><h3>RAG Is Not Plug-and-Play</h3><p>In simple demos, RAG looks magical. Ask a question &#8594; get a grounded answer.</p><p>But in production?</p><blockquote><p>RAG is a data engineering problem disguised as an NLP trick.</p></blockquote><p>And the teams who succeed with it are the ones who:</p><ul><li><p>Treat document ingestion like software engineering</p></li><li><p>Own their preprocessing pipeline like an ML pipeline</p></li><li><p>Monitor retrieval quality, not just model latency</p></li></ul><p>So yes, LLMs are powerful.</p><p>But in RAG?</p><blockquote><p>The system only works if your index is a reflection of reality.</p><p>And building that index&#8230; is where the real engineering lives.</p></blockquote><h2>Where RAG Fails (And Why It&#8217;s Not Magic)</h2><p>At this point, RAG might sound like the cleanest solution to the static LLM problem &#8212; and in many ways, it is.</p><p>But here&#8217;s the real picture:</p><blockquote><p>RAG isn&#8217;t a silver bullet. It&#8217;s a layered system &#8212; and every layer can break.</p></blockquote><p>Let&#8217;s walk through the most common failure points.</p><h3>1. <strong>Relevant Chunk Not Retrieved &#8594; Irrelevant Answer</strong></h3><p>This is the <strong>most frequent failure mode</strong> &#8212; and the easiest to miss.</p><p>If the retriever doesn&#8217;t surface the right chunk, the LLM will confidently answer using whatever is closest, even if it&#8217;s wrong.</p><p>You&#8217;ll see:</p><ul><li><p>Outdated policies getting returned</p></li><li><p>Answers pulled from unrelated but similar-sounding chunks</p></li><li><p>Hallucinated claims based on misleading context</p></li></ul><p>What&#8217;s broken here isn&#8217;t the model.</p><p>It&#8217;s retrieval quality, and that&#8217;s downstream of bad chunking, poor embeddings, or inadequate metadata filtering.</p><h3>2. <strong>Model Gets the Right Context &#8212; But Ignores It</strong></h3><p>Sometimes the retriever does its job.</p><p>You get the right chunk. The context is injected. Everything looks good.</p><p>But the answer?</p><p>Wrong, generic, or completely detached from the provided data.</p><p>What&#8217;s happening here?</p><blockquote><p>The model isn&#8217;t grounded. It&#8217;s guessing &#8212; blending pretrained knowledge with your context instead of sticking to it.</p></blockquote><p>This happens when:</p><ul><li><p>There&#8217;s <strong>no clear instruction</strong> to use <em>only</em> the retrieved content</p></li><li><p>The question is vague, but the context isn&#8217;t enforced</p></li><li><p>The model&#8217;s prior training overrides the injected source</p></li></ul><p>The result: plausible answers that contradict your ground truth.</p><p>This is especially dangerous in compliance or legal workflows, where <em>hallucinating within context</em> is worse than not answering at all.</p><blockquote><p>RAG isn&#8217;t just retrieval. It&#8217;s retrieval plus constraint.</p><p>Without both, you&#8217;re just helping the model hallucinate <em>better</em>.</p></blockquote><h3>3. <strong>Semantic Mismatch Between Query and Chunk</strong></h3><p>This one&#8217;s harder to spot.</p><p>Let&#8217;s say your index includes:</p><p>&#8220;Users are entitled to a full refund within 7 days.&#8221;</p><p>But the user asks:</p><blockquote><p>&#8220;Can I cancel and get my money back?&#8221;</p></blockquote><p>If your embeddings or retrieval method can&#8217;t connect &#8220;cancel&#8221; &#8594; &#8220;refund&#8221;, or &#8220;money back&#8221; &#8594; &#8220;entitled&#8221;,</p><p>&#8594; That chunk won&#8217;t surface.</p><p>This is where <strong>language gaps</strong>, <strong>jargon</strong>, and <strong>undertrained embedding models</strong> create false negatives.</p><p>It&#8217;s not about bad data. It&#8217;s about a <strong>missed semantic bridge.</strong></p><h3>4. <strong>Information is Split Across Chunks</strong></h3><p>Sometimes the answer isn&#8217;t in a single chunk &#8212; it lives across two or three.</p><p>E.g.:</p><ul><li><p>Chunk A says: <em>&#8220;Refunds available for digital products.&#8221;</em></p></li><li><p>Chunk B says: <em>&#8220;Refunds must be requested within 7 days.&#8221;</em></p></li></ul><p>Both are required for a full answer.</p><p>But standard (naive) RAG systems don&#8217;t do multi-hop synthesis well, especially if chunk order isn&#8217;t preserved or coherence is lost in truncation.</p><p>Unless you&#8217;ve designed your chunking and scoring to preserve continuity,</p><p>&#8594; You get partial answers, or worse, confident contradictions.</p><h3>5. <strong>The Right Data Isn&#8217;t in the Index at All</strong></h3><p>This is a classic ingestion blind spot.</p><p>Sometimes, the most relevant information:</p><ul><li><p>Lives in a format you didn&#8217;t ingest (e.g. image-based PDFs, buried tables, raw HTML)</p></li><li><p>Was missed due to bad parsing</p></li><li><p>Was updated in the source system, but your index is stale</p></li></ul><p>RAG can&#8217;t retrieve what isn&#8217;t there.</p><p>That&#8217;s why <strong>index observability and refresh strategies</strong> are part of any serious RAG system &#8212; not just &#8220;nice to have.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n_GO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n_GO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!n_GO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!n_GO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!n_GO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n_GO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:569679,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163055255?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n_GO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!n_GO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!n_GO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!n_GO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5dffafa-25d0-44ff-b169-5b5d0812ab7d_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>RAG Doesn&#8217;t Fail Loudly &#8212; It Fails Silently</h2><p>And that&#8217;s the danger.</p><p>Unlike traditional software bugs, most RAG failures look like they worked:</p><ul><li><p>The model gives an answer</p></li><li><p>It&#8217;s grammatically correct</p></li><li><p>It sounds plausible</p></li></ul><p>But it&#8217;s wrong. And it&#8217;s grounded in the <em>wrong</em> data.</p><p>So you need to build for:</p><ul><li><p>Retrieval monitoring</p></li><li><p>Prompt observability</p></li><li><p>Failure evaluation beyond accuracy metrics</p></li></ul><p>Because in RAG, silence isn&#8217;t success.</p><blockquote><p>It might be a confident lie &#8212; and that&#8217;s the hardest kind to debug.</p></blockquote><h2>Beyond Just Docs &#8212; Smart RAG Systems</h2><p>At this point, it&#8217;s clear:</p><blockquote><p>Naive RAG breaks in all the places that matter.</p></blockquote><p>It retrieves the wrong thing. Or it retrieves the right thing and still gives the wrong answer.</p><p>It doesn&#8217;t know when to say &#8220;I don&#8217;t know.&#8221;</p><p>It doesn&#8217;t improve over time.</p><p>But that&#8217;s not the limit of what RAG can be.</p><p>Let&#8217;s talk about how smart systems fix it.</p><h3>The Mental Shift</h3><p>Most people think RAG = &#8220;vector search over some docs.&#8221;</p><p>But in production, <strong>RAG becomes an optimisation problem</strong>:</p><blockquote><p>How do we consistently retrieve the most useful, most answerable context &#8212; while minimizing noise, cost, and latency?</p></blockquote><p>Smart RAG systems don&#8217;t just retrieve.</p><p>They <strong>rank</strong>, <strong>filter</strong>, <strong>score</strong>, <strong>adapt</strong>, and <strong>learn</strong>.</p><p>Let&#8217;s walk through the upgrades that transform fragile RAG into robust retrieval-first infrastructure.</p><h3>1. <strong>Hybrid Retrieval Fixes Semantic Blind Spots</strong></h3><p><strong>Problem it solves:</strong></p><p>Vectors alone miss exact matches, rare entities, and domain-specific tokens.</p><p><strong>Fix:</strong></p><p>Use both:</p><ul><li><p><strong>Dense retrieval</strong> (for semantic similarity)</p></li><li><p><strong>Sparse retrieval</strong> (BM25 or keyword scoring for exact matches)</p></li></ul><p>This immediately improves:</p><ul><li><p>Retrieval precision on short, vague queries</p></li><li><p>Performance on identifiers (like error codes, SKUs, names)</p></li></ul><blockquote><p>Hybrid retrieval makes your system semantic and literal &#8212; exactly when it matters.</p></blockquote><h3>2. <strong>Re-ranking Models Pick the </strong><em><strong>Right</strong></em><strong> Top-K</strong></h3><p><strong>Problem it solves:</strong></p><p>Top-K based on cosine similarity &#8800; most useful chunks.</p><p><strong>Fix:</strong></p><p>After initial retrieval, use a <strong>cross-encoder</strong> (like Cohere Rerank or BGE-Reranker) to rescore the candidates for:</p><ul><li><p>Factual match</p></li><li><p>Answerability</p></li><li><p>Coverage</p></li></ul><p>This reranking happens before prompt injection, and it <strong>massively reduces hallucinations</strong>.</p><blockquote><p>Smart RAG doesn't just ask &#8220;what&#8217;s similar?&#8221;</p><p>It asks: &#8220;what actually answers the question?&#8221;</p></blockquote><h3>3. <strong>Metadata Filtering Reduces Retrieval Scope</strong></h3><p><strong>Problem it solves:</strong></p><p>Retrieving from <em>everything</em> leads to irrelevant or outdated context.</p><p><strong>Fix:</strong></p><p>Leverage metadata added during ingestion:</p><ul><li><p><code>product_version = "2.1"</code></p></li><li><p><code>source = "legal"</code></p></li><li><p><code>created_at &gt;= last_month</code></p></li></ul><p>This constrains search to what&#8217;s actually relevant <strong>before</strong> you rank.</p><blockquote><p>Your data isn&#8217;t flat. Treating it like it is kills precision.</p></blockquote><h3>4. <strong>Semantic Chunking Improves Retrieval Quality</strong></h3><p><strong>Problem it solves:</strong></p><p>Poor chunking creates semantically meaningless embeddings.</p><p><strong>Fix:</strong></p><p>Don&#8217;t split blindly every N tokens. Instead:</p><ul><li><p>Chunk by sections, paragraphs, headings</p></li><li><p>Use sentence boundary detection</p></li><li><p>Keep context units (e.g., question+answer, method+docstring) together</p></li></ul><p>Optional: Add <strong>contextual overlap</strong> and <strong>parent references</strong> for coherence.</p><blockquote><p>You&#8217;re not embedding text. You&#8217;re embedding meaning units. Chunk accordingly.</p></blockquote><h3>5. <strong>Retrieval Feedback Loops Make the System Learn</strong></h3><p><strong>Problem it solves:</strong></p><p>You don&#8217;t know which chunks actually helped the LLM answer well.</p><p><strong>Fix:</strong></p><p>Track:</p><ul><li><p>Which chunks were injected</p></li><li><p>Whether the response was accepted/clicked</p></li><li><p>Retrieval-to-output overlap (did the model use the retrieved info?)</p></li></ul><p>Then:</p><ul><li><p>Up-rank useful chunks over time</p></li><li><p>Down-rank misleading ones</p></li><li><p>Use hard negatives to fine-tune better embeddings</p></li></ul><blockquote><p>RAG isn&#8217;t static. If it doesn&#8217;t learn, it decays.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PB-Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PB-Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png 424w, https://substackcdn.com/image/fetch/$s_!PB-Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png 848w, https://substackcdn.com/image/fetch/$s_!PB-Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png 1272w, https://substackcdn.com/image/fetch/$s_!PB-Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PB-Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png" width="1456" height="1699" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1699,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:629164,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163055255?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PB-Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png 424w, https://substackcdn.com/image/fetch/$s_!PB-Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png 848w, https://substackcdn.com/image/fetch/$s_!PB-Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png 1272w, https://substackcdn.com/image/fetch/$s_!PB-Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daaa31-640c-4356-8d3b-180c210e3ca4_2400x2800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Advanced Patterns (When You&#8217;re Ready to Scale)</h3><p>Once your retrieval foundation is solid, these advanced patterns can unlock more robust and intelligent behaviour:</p><ul><li><p><strong>Multi-Query RAG</strong></p><p>&#8594; Reformulates a single user query into multiple semantic variants, retrieves for each, and merges results.</p><p>Boosts recall for vague or underspecified questions, especially in sparse or noisy corpora.</p></li><li><p><strong>Knowledge Graph-Augmented RAG</strong></p><p>&#8594; Uses a graph of entities and relationships to guide retrieval.</p><p>Enables structured reasoning, cross-doc linking, and retrieval based on relationships, not just raw text.</p></li><li><p><strong>Multi-Agent RAG</strong></p><p>&#8594; Chains specialised agents (retrievers, verifiers, planners) to dynamically reformulate queries, rerank chunks, and validate answers.</p><p>Useful for multi-hop queries, tool integration, and dynamic workflows.</p></li></ul><blockquote><p>These are not required to get started &#8212; but they represent the future of RAG at scale.</p></blockquote><h2>Evaluating RAG Systems &#8212; The RAG Triad Framework</h2><p>Building a RAG system is one challenge.</p><p><strong>Knowing if it works</strong> &#8212; that&#8217;s a different problem.</p><p>And here's where most teams go wrong:</p><blockquote><p>They evaluate the model&#8217;s answer, not the retrieval system behind it.</p></blockquote><p>That&#8217;s like debugging a recommendation engine by checking if the &#8220;Buy&#8221; button was clicked, without knowing what products were shown.</p><p>RAG needs its own evaluation lens.</p><p>And that&#8217;s where the <strong>RAG Triad Framework</strong> comes in.</p><h3>The RAG Triad</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tnNh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tnNh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!tnNh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!tnNh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!tnNh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tnNh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:400034,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/163055255?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tnNh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!tnNh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!tnNh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!tnNh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1262431a-c014-4460-99b3-2a2d0c6b5a74_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To truly evaluate a RAG pipeline, you need to assess three interdependent components:</p><ol><li><p><strong>Retrieval Quality</strong></p></li><li><p><strong>Faithfulness</strong></p></li><li><p><strong>Answer Quality</strong></p></li></ol><p>These are distinct, and failures in one don&#8217;t always show up in the others.</p><p>Let&#8217;s break them down.</p><h3>1. Retrieval Quality</h3><blockquote><p>&#8220;Did we fetch the right context in the first place?&#8221;</p></blockquote><p>The first test of any RAG system is whether the retriever surfaced the <strong>relevant, sufficient, and necessary</strong> information for the query.</p><p><strong>Key metrics:</strong></p><ul><li><p><strong>Recall@k</strong>: Was the gold/relevant chunk in the top-k retrieved?</p></li><li><p><strong>Precision@k</strong>: Of the chunks retrieved, how many were actually relevant?</p></li><li><p><strong>Retrieval overlap</strong>: Did the model actually use any of what was retrieved?</p></li></ul><p><strong>How to measure:</strong></p><ul><li><p>Human-labelled gold chunks (if available)</p></li><li><p>Embedding-based similarity to expected answer</p></li><li><p>Heuristic filters like answer-span matching</p></li></ul><p>Without good retrieval, the model is guessing.</p><blockquote><p>Bad retrieval = good prompt = bad answer.</p></blockquote><h3>2. Faithfulness</h3><blockquote><p>&#8220;Did the model stay true to the retrieved context?&#8221;</p></blockquote><p>RAG was supposed to fix hallucination.</p><p>But if your model <strong>mixes retrieved content with pretrained priors</strong>, it&#8217;s just inventing grounded-sounding fiction.</p><p><strong>Key metrics:</strong></p><ul><li><p><strong>Context overlap</strong>: Does the answer contain text or meaning from retrieved chunks?</p></li><li><p><strong>Faithfulness score</strong> (via QA or entailment models): Does the answer <em>only</em> use supported info?</p></li><li><p><strong>Contradiction flags</strong>: Does the output contradict any retrieved source?</p></li></ul><p><strong>How to measure:</strong></p><ul><li><p>Use natural language inference (NLI) models to compare output vs. context</p></li><li><p>Extract claims from outputs and check if they are supported by any chunk</p></li><li><p>Human eval with source checking</p></li></ul><p>This is <strong>not</strong> about correctness &#8212; it&#8217;s about <em>alignment with the context</em>.</p><blockquote><p>A factually wrong answer that used the context correctly is a retrieval issue.</p><p>A factually wrong answer that <em>ignored context</em> is a generation issue.</p></blockquote><h3>3. Answer Quality</h3><blockquote><p>&#8220;Was the final answer useful, clear, and complete?&#8221;</p></blockquote><p>This is the traditional metric most teams already focus on:</p><ul><li><p>Fluency</p></li><li><p>Task completion</p></li><li><p>Helpfulness</p></li><li><p>Hallucination (at the output level)</p></li></ul><p><strong>Key metrics:</strong></p><ul><li><p><strong>ROUGE/BLEU/METEOR</strong> (if ground truth exists)</p></li><li><p><strong>Judgmental scores</strong> (e.g. helpful/unhelpful from human labelers)</p></li><li><p><strong>LLM-as-a-judge</strong> methods (scoring based on criteria)</p></li></ul><p>But answer quality without retrieval traceability is deceptive:</p><blockquote><p>A &#8220;great&#8221; answer from hallucinated data is a future failure waiting to happen.</p></blockquote><h2>Why the Triad Matters</h2><p>Evaluating just the final answer is like checking the tip of an iceberg.</p><p>A fluent answer that <em>sounds right</em> could still be completely wrong, because either:</p><ul><li><p>The wrong chunk was retrieved (retrieval failure)</p></li><li><p>The model ignored the context and guessed (faithfulness failure)</p></li><li><p>Or the answer, while correct, was vague or incomplete (answer quality failure)</p></li></ul><p>Here&#8217;s the mental model:</p><blockquote><p>If retrieval fails, the model never had the right info to begin with. If faithfulness fails, the model had it &#8212; but didn&#8217;t use it. If answer quality fails, the system did everything right &#8212; but failed to communicate.</p></blockquote><p>You don&#8217;t fix these by &#8220;prompting better&#8221; or &#8220;changing the model.&#8221;</p><p>You fix them by diagnosing the exact stage that broke, and improving that layer of the system.</p><p>That&#8217;s what the RAG Triad lets you do:</p><p><strong>treat RAG as a system</strong>, not a monolith.</p><h2>Congratulations you&#8217;re no longer a Noob. You&#8217;re a Builder.</h2><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading NeoSage! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>You didn&#8217;t just read about RAG.</p><p>You rebuilt how you think about it, layer by layer.</p><p>You now know it&#8217;s not a trick to bolt onto a chatbot.</p><p>It&#8217;s an architecture that lives or dies by how well you:</p><ul><li><p>Shape the data</p></li><li><p>Design the retrieval</p></li><li><p>Control the grounding</p></li><li><p>Monitor the system</p></li><li><p>Evaluate the full pipeline</p></li></ul><p>You&#8217;ve moved past the &#8220;let&#8217;s vectorise our docs&#8221; stage.</p><p>You&#8217;re now thinking like a <strong>retrieval architect</strong> &#8212; with clarity on where things break, and how to build them to last.</p><p>And that&#8217;s the real unlock.</p><blockquote><p>Because in a world rushing to plug in LLMs, those who master retrieval will quietly run circles around everyone else.</p></blockquote><p>Welcome to the deep end.</p><p>Stay dangerous.</p><h2>Further Reading &amp; References</h2><ul><li><p><a href="https://arxiv.org/abs/2005.11401">RAG: Retrieval-Augmented Generation (Original Paper)</a></p></li><li><p><a href="https://www.pinecone.io/learn/series/rag/embedding-models-rundown/">Embedding Models Rundown (Pinecone)</a></p></li><li><p><a href="https://python.langchain.com/docs/how_to/#retrievers">LangChain Retriever Guide</a></p></li><li><p><a href="https://docs.llamaindex.ai/en/stable/understanding/rag/">LlamaIndex: Understanding RAG</a></p></li><li><p><a href="https://www.deepeval.com/guides/guides-rag-evaluation">DeepEval: RAG Evaluation Guide</a></p></li><li><p><a href="https://docs.ragas.io/en/stable/getstarted/rag_eval/">RAGAS: RAG Evaluation Toolkit</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[Why Every AI Builder Needs to Understand MCP]]></title><description><![CDATA[The Model Context Protocol is redefining how we build with AI. Here&#8217;s what it is, why it matters, and how to use it to build your own modular AI Second Brain.]]></description><link>https://blog.neosage.io/p/why-every-ai-builder-needs-to-understand</link><guid isPermaLink="false">https://blog.neosage.io/p/why-every-ai-builder-needs-to-understand</guid><dc:creator><![CDATA[Shivani Virdi]]></dc:creator><pubDate>Wed, 30 Apr 2025 16:58:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>LLMs are powerful.<br>But they weren&#8217;t designed to operate in real-world environments on their own.</p><p>They can generate text.<br>But they don&#8217;t know how to access tools, query APIs, fetch files, or maintain long-term memory&#8212;unless you manually wire those systems together.</p><p>And that&#8217;s the problem.</p><p>Every time you want your model to do something useful, you:</p><ul><li><p>Build another wrapper</p></li><li><p>Hardcode another integration</p></li><li><p>Stitch another brittle prompt flow</p></li></ul><p>There&#8217;s no shared interface between the model and the systems it needs to work with.</p><p>This doesn&#8217;t scale.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hgjg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hgjg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!Hgjg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!Hgjg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!Hgjg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hgjg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:868847,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/162375061?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hgjg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!Hgjg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!Hgjg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!Hgjg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2671cbb7-c7b8-40a4-b467-6062fd30d853_2400x2400.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.neosage.io/subscribe?"><span>Subscribe now</span></a></p><p><strong>That&#8217;s where MCP (Model Context Protocol) comes in.</strong></p><p>In this issue, we&#8217;ll break down:</p><ul><li><p>What makes today&#8217;s LLM integrations fragile and repetitive</p></li><li><p>How MCP (Model Context Protocol) introduces clean separation between models and tools</p></li><li><p>How it works under the hood</p></li><li><p>And how to use it to build a fully modular Second Brain with Claude</p></li></ul><p>Let&#8217;s dive in.</p><h2><strong>The Application Layer Reality</strong></h2><p>Most LLMs today are built for one thing:<br><strong>text prediction inside a context window</strong>.</p><p>That&#8217;s not the same as system behavior.</p><p>Real applications need more than language&#8212;they need structure.</p><p>When you step into the application layer, the model is just one piece.<br>To build a working system, you need to:</p><ul><li><p>Access internal documents and file systems</p></li><li><p>Pull live information from the web</p></li><li><p>Query databases and APIs</p></li><li><p>Chain steps across multiple turns</p></li><li><p>Maintain memory over time</p></li><li><p>React to inputs and return usable outputs</p></li></ul><p>But LLMs can&#8217;t do any of that out of the box.<br>Not without external scaffolding.</p><p>And right now, that scaffolding is usually handwritten:</p><ul><li><p>You bolt on a wrapper</p></li><li><p>Patch in a prompt</p></li><li><p>Hardwire tool logic into your flow</p></li></ul><p>Let&#8217;s make this concrete.</p><h2>Example: A Research Assistant</h2><p>Say you&#8217;re building a Research Assistant with an LLM.</p><p>You want it to:</p><ul><li><p>Search internal knowledge bases</p></li><li><p>Summarise findings from the web</p></li><li><p>Pull structured insights from APIs</p></li><li><p>Organise output into clean project notes</p></li></ul><p>But to make that work, you&#8217;ll need to:</p><ul><li><p>Write wrappers for file access</p></li><li><p>Plug into search endpoints</p></li><li><p>Build prompt flows to chain queries and outputs</p></li><li><p>Manually inject context between each step</p></li></ul><p>The LLM isn&#8217;t acting like a system.<br>You are, by glueing things together behind the scenes.</p><p>It&#8217;s not composable.<br>It&#8217;s not scalable.<br>And it's definitely not reusable.</p><p>And up until now?<br>There was no standard way to do it.</p><p>Everyone built their own bridges&#8212;fragile, bespoke, duct-taped into existence.</p><ul><li><p>Every new tool meant a new integration.</p></li><li><p>Every platform shift meant rewriting half your code.</p></li><li><p>Every update felt like starting from scratch.</p></li></ul><p>The result?</p><p><strong>Brittle architectures that break under their own weight.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zOC0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zOC0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!zOC0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!zOC0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!zOC0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zOC0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:270309,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/162375061?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zOC0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!zOC0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!zOC0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!zOC0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f169b1-b697-4bec-b5ec-c1793db14a3e_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p>And that&#8217;s the hidden bottleneck that stops LLMs from scaling into true modular systems.<br>It&#8217;s the bottleneck that <strong>MCP was designed to solve</strong>.</p></div><h2><strong>The Fragile State of LLM Applications Today</strong></h2><p>At first, the hacks seem fine.</p><p>You hardcode a wrapper here, a prompt flow there.<br>You scrape together a way to get the model working with your file system, a search API, and maybe a memory loop.</p><p>And for a while, it holds.</p><p>But then the system grows.</p><p>You add a new capability.<br>A new tool.<br>A new workflow.</p><p>And suddenly, everything feels fragile.</p><p>Because every piece is <strong>tightly coupled</strong>:</p><ul><li><p>Your API logic is baked into your prompts</p></li><li><p>Your memory patch relies on exact formatting</p></li><li><p>Your file reader is wired directly into the model context</p></li></ul><p>Nothing is modular.<br>Everything is handcrafted.</p><p>You tweak one thing, and something else breaks.</p><p>You try to reuse logic in a new assistant, but the wrappers were written for one use case only.</p><p>You upgrade your model, and the entire flow has to be debugged from scratch.</p><p>The system doesn&#8217;t scale.<br>It mutates.<br>And you&#8217;re stuck holding it together.</p><h4>Why This Matters</h4><p>This isn&#8217;t just frustrating.<br>It&#8217;s the <strong>core bottleneck</strong> behind most LLM systems today.</p><p>You&#8217;re not failing because the model can&#8217;t reason.<br>You&#8217;re failing because there&#8217;s <strong>no standard interface</strong> between your model and the real world.</p><p>Up until now, you had two options:</p><ul><li><p>Build everything custom</p></li><li><p>Or limit what your app could actually do</p></li></ul><p>Neither is sustainable.</p><h2><strong>Scaling Breaks Everything</strong></h2><p>Let&#8217;s say your Research Assistant is live.<br>It pulls documents, searches the web, queries APIs&#8212;you wired it all up manually.</p><p>Now your team wants to expand.</p><p>They ask for:</p><ul><li><p>A <strong>Sales Assistant</strong> that pulls customer data from the same database</p></li><li><p>A <strong>Project Manager bot</strong> that uses the same search functionality</p></li><li><p>A <strong>Marketing agent</strong> that also pulls docs and formats summaries</p></li></ul><p>Each of these needs:</p><ul><li><p>File system access</p></li><li><p>Internet search</p></li><li><p>Access to structured APIs</p></li></ul><p>In a perfect world, you&#8217;d just plug them into what you already built.</p><p>But that&#8217;s not what happens.</p><p>You rewrap the same logic again and again:</p><ul><li><p>The Sales bot gets a new DB wrapper</p></li><li><p>The PM bot gets its own search integration</p></li><li><p>Each assistant has its own custom prompt chain, tied to a slightly different tool flow</p></li></ul><p>What started as one app using three tools becomes <strong>three apps using the same tools&#8212;but with three separate integrations each.</strong></p><blockquote><p>M apps &#215; N tools = M&#215;N custom connections</p></blockquote><p>Every line of integration is fragile.<br>Every wrapper is bespoke.<br>Every update means rewriting across the stack.</p><p>Instead of scaling, you multiply chaos.</p><p>The result?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IKYh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IKYh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!IKYh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!IKYh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!IKYh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IKYh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:522776,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/162375061?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IKYh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!IKYh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!IKYh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!IKYh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd847454a-7c6f-4b79-9325-5244ab99d4a5_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Your simple system has become <strong>a house of cards</strong>.</p><p>This is the point most AI systems start breaking down, not because the model can&#8217;t handle the work,<br>But because <strong>the surrounding architecture wasn&#8217;t built to scale.</strong></p><p>There&#8217;s no shared layer.<br>No reusable interface.<br>No clean way to connect multiple apps to the same set of capabilities.</p><blockquote><p><strong>Up until now.</strong></p></blockquote><h2><strong>MCP: The Missing Infrastructure Layer</strong></h2><p>If you&#8217;ve worked on any non-trivial LLM system, you&#8217;ve probably felt it:</p><blockquote><p><em>The model isn&#8217;t the hard part. The integration is.</em></p></blockquote><p>Same tool. New integration. Every time.</p><p>That&#8217;s the real cost of building without a standard interface.</p><p><strong>MCP (Model Context Protocol)</strong> exists to fix this.</p><p>It&#8217;s an open protocol that defines <strong>how models can interact with external capabilities</strong>&#8212;<br>like tools, data, prompts, and memory&#8212;using structured, discoverable interfaces.</p><p>No wrappers.<br>No one-off glue logic.<br>No bespoke JSON hacks just to reuse the same capability twice.</p><p>With MCP:</p><ul><li><p>You expose a capability once&#8212;as a resource, tool, or prompt</p></li><li><p>Any MCP-compatible model can discover and use it</p></li><li><p>You stop rewriting integration code across agents, workflows, or frontends</p></li></ul><blockquote><p>It&#8217;s not a framework.<br>It&#8217;s the interface layer that&#8217;s been missing from LLM systems.</p></blockquote><h2><strong>How MCP Works</strong></h2><p>Modern LLM systems fail when the model has to know too much about how tools are built.</p><p>LLMs aren&#8217;t meant to know how tools work.<br>And tools shouldn&#8217;t care what model is calling them.</p><p><strong>MCP enforces that separation.</strong></p><p>It introduces a simple but powerful structure:<br>A clean boundary between where the model runs, where protocol logic lives, and where external capabilities are exposed.</p><h2><strong>The Core Actors: Host, Client, Server</strong></h2><p>Here&#8217;s the basic architecture:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X6bE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X6bE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!X6bE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!X6bE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!X6bE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X6bE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:245027,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/162375061?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X6bE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!X6bE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!X6bE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!X6bE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae5d99-5930-4076-a5e2-139003669b2c_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Host</strong><br>The application running the model&#8212;like Claude Desktop, an IDE plugin, or an agent runtime.<br>It provides the user experience and handles orchestration.</p></li><li><p><strong>Client</strong><br>A protocol engine that lives inside the host.<br>It connects to MCP-compatible servers, sends requests, handles responses, and manages the message lifecycle.</p></li><li><p><strong>Server</strong><br>A standalone process that exposes capabilities&#8212;like tools, resources, memory, or prompt templates&#8212;via typed interfaces over MCP.</p></li></ul><p>Each actor has a single responsibility:</p><ul><li><p>&#9989; Hosts don&#8217;t need to know how tools are implemented</p></li><li><p>&#9989; Servers don&#8217;t need to know how their output is displayed</p></li><li><p>&#9989; Clients translate between the two reliably and predictably</p></li></ul><p>They speak the same protocol: <strong>JSON-RPC over a flexible transport layer</strong>.</p><p>This structure is what makes MCP scalable:</p><blockquote><p>One model. Many servers. Clean boundaries. No shared assumptions.</p></blockquote><h2><strong>How They Communicate: The Message Flow</strong></h2><p>Every interaction between an MCP Client and Server follows the same protocol:</p><p><strong>Structured, typed messages using JSON-RPC 2.0.</strong></p><h3><strong>Core Message Types</strong></h3><p>MCP supports four core message types:</p><ul><li><p><strong>Request</strong> &#8212; Sent by the client to ask the server to perform an action</p></li><li><p><strong>Response</strong> &#8212; Sent by the server when a request succeeds, returning a result</p></li><li><p><strong>Error</strong> &#8212; Sent by the server when a request fails<br>Includes a <code>code</code>, a <code>message</code>, and optional <code>data</code> (for structured debugging)</p></li><li><p><strong>Notification</strong> &#8212; One-way messages that don&#8217;t expect a response<br><em>(e.g., server announces a tool list update)</em></p></li></ul><p>These messages are <strong>typed</strong>, <strong>versioned</strong>, and <strong>schema-validatable</strong>, making them consistent and extensible across tools and clients.</p><h3><strong>The Lifecycle</strong></h3><p>An MCP session follows a predictable sequence:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aK0G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aK0G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!aK0G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!aK0G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!aK0G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aK0G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:324162,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/162375061?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aK0G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!aK0G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!aK0G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!aK0G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4f198a1-23ef-41eb-b5bd-2a84bc4e6320_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>1. Initialization</strong></p><ul><li><p>The client sends an <code>initialize</code> request, declaring its protocol version and capabilities</p></li><li><p>The server responds with its own capabilities</p></li><li><p>The client confirms readiness with an <code>initialized</code> notification</p></li></ul><p><strong>2. Communication</strong></p><ul><li><p>The client issues typed requests to the server (like <code>callTool</code>, <code>readResource</code>)</p></li><li><p>The server replies with results or structured errors</p></li><li><p>Either side can send notifications to announce updates or state changes</p></li></ul><p><strong>3. Shutdown</strong></p><ul><li><p>Either the client or the server can initiate a clean shutdown via <code>shutdown</code> and <code>exit</code></p></li></ul><p>All messages are logged and type-checked, making the protocol reliable for large-scale applications and easy to debug.</p><h2><strong>How Messages Move: The Transport Layer</strong></h2><p>While the <strong>message structure is always JSON-RPC</strong>, MCP supports flexible transport options:</p><ul><li><p><strong>stdio</strong> &#8212; Fast, local, ideal for small or embedded tools</p></li><li><p><strong>http</strong> &#8212; Stateless, commonly used for hosted deployments</p></li><li><p><strong>sse</strong> (Server-Sent Events) &#8212; Persistent streams for long-lived processes</p></li><li><p><strong>Custom transports</strong> &#8212; You can define your own, as long as JSON-RPC is preserved</p></li></ul><blockquote><p>Whatever the transport, the message contract stays the same.<br>That&#8217;s what makes MCP modular&#8212;<strong>infrastructure can change, but the interface doesn&#8217;t.</strong></p></blockquote><h2><strong>What Servers Can Expose</strong></h2><p>Every MCP server can expose one or more structured capabilities to the client.<br>These are called <strong>primitives</strong>, and each follows a well-defined schema and discovery flow.</p><h3><strong>Resources</strong></h3><p>Expose structured or dynamic content&#8212;like file systems, APIs, or generated documents.</p><p>Clients can:</p><ul><li><p>List available resources</p></li><li><p>Read or stream their content</p></li><li><p>Subscribe to updates via notifications</p></li></ul><h3><strong>Prompts</strong></h3><p>Serve reusable, parameterised prompt templates.</p><p>Clients can:</p><ul><li><p>List available prompts</p></li><li><p>Preview how they behave</p></li><li><p>Invoke them with specific input</p></li></ul><p>This allows you to standardise reasoning steps, formatting, or task flows.</p><h3><strong>Tools</strong></h3><p>Expose executable actions&#8212;each defined with a JSON schema for input and output.</p><p>Clients can:</p><ul><li><p>Discover tool metadata</p></li><li><p>Call tools with structured arguments</p></li><li><p>Receive typed, predictable results</p></li></ul><p>This makes functions callable like APIs&#8212;without custom glue.</p><h3><strong>Sampling</strong></h3><p>Let the server initiate model completions using the host&#8217;s LLM.</p><p>This is used when the <strong>server needs to ask the model a question</strong>, for planning, intermediate reasoning, or delegated generation.</p><p>It supports:</p><ul><li><p>Structured prompts sent from the server</p></li><li><p>Completions returned from the client-side LLM</p></li><li><p>Use in multi-agent chains or feedback loops</p></li></ul><p>Together, these primitives allow servers to offer rich, typed capabilities&#8212;<br>and allow clients to <strong>discover, use, and compose them</strong> without writing bespoke logic per app.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IYay!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F555c5748-5527-4518-a434-b114c734515e_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IYay!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F555c5748-5527-4518-a434-b114c734515e_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!IYay!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F555c5748-5527-4518-a434-b114c734515e_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!IYay!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F555c5748-5527-4518-a434-b114c734515e_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!IYay!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F555c5748-5527-4518-a434-b114c734515e_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IYay!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F555c5748-5527-4518-a434-b114c734515e_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/555c5748-5527-4518-a434-b114c734515e_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:582435,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/162375061?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F555c5748-5527-4518-a434-b114c734515e_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IYay!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F555c5748-5527-4518-a434-b114c734515e_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!IYay!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F555c5748-5527-4518-a434-b114c734515e_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!IYay!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F555c5748-5527-4518-a434-b114c734515e_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!IYay!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F555c5748-5527-4518-a434-b114c734515e_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>From M&#215;N to Modularity</strong></h2><p>Earlier, we saw what breaks:</p><blockquote><p>M apps &#215; N tools = M&#215;N custom integrations</p></blockquote><p>Every new assistant.<br>Every shared capability.<br>A new wrapper. A new prompt chain. A new point of failure.</p><p>Now let&#8217;s look at what happens with MCP.</p><p>Say you&#8217;re building the same three assistants:</p><ul><li><p>A <strong>Research Assistant</strong> that needs to access internal documents</p></li><li><p>A <strong>Sales Assistant</strong> that needs to query customer data via API</p></li><li><p>A <strong>Project Manager</strong> that needs to fetch live market insights from the web</p></li></ul><p>Each assistant depends on:</p><ul><li><p>A document store</p></li><li><p>A structured internal API</p></li><li><p>A search interface</p></li></ul><h3>With MCP, those capabilities are exposed once, as servers.</h3><p>Each tool or resource is packaged as an <strong>MCP server</strong>, with a typed, discoverable interface:</p><ul><li><p>A Filesystem server</p></li><li><p>A Web Search server</p></li><li><p>An API server</p></li></ul><p>Each assistant connects to those servers <strong>through a dedicated MCP client</strong>.</p><ul><li><p>Clients maintain a 1:1 connection to each server</p></li><li><p>The host (e.g. Claude Desktop) manages these connections per server</p></li><li><p>Assistants issue typed requests through the appropriate client</p></li><li><p>Servers respond with structured results&#8212;no custom glue required</p></li></ul><p>There&#8217;s no duplicated wiring.<br>No per-app integration logic.<br>No more M&#215;N explosion.</p><blockquote><p>You don&#8217;t wire logic into every app.<br>You expose capabilities&#8212;and let apps connect to them.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!htrm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!htrm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!htrm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!htrm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!htrm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!htrm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:585587,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/162375061?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!htrm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!htrm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!htrm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!htrm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8eaf81e-66bf-4251-ad0a-f67118799347_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Why It Matters</strong></h2><p>This is how you solve the real scaling bottleneck:</p><ul><li><p>Capabilities are written once</p></li><li><p>Interfaces are reused across assistants</p></li><li><p>Behavior is separated from infrastructure</p></li></ul><p>With MCP:</p><ul><li><p>Models use tools they weren&#8217;t handcoded for</p></li><li><p>Tools don&#8217;t depend on prompts or app logic</p></li><li><p>Assistants become orchestrators&#8212;not wrappers</p></li></ul><p>You stop stitching things together.<br>You start composing real systems.</p><h2><strong>Building Your Second Brain with Claude and MCP</strong></h2><p>Understanding MCP is one thing.<br>Building with it is where the design comes alive.</p><p>And there&#8217;s no better place to start than setting up a modular Second Brain&#8212;<br>an AI system that can search, retrieve, reason, and grow without glue code or prompt hacks.</p><h2><strong>The Setup: Claude + MCP Servers</strong></h2><p>At the heart of this setup:</p><ul><li><p><strong>Claude Desktop</strong> acts as the <strong>host and client</strong><br>(Already supports MCP natively)</p></li><li><p><strong>MCP Servers</strong> are lightweight programs you run locally<br>(Each one exposes tools, resources, prompts, or memory as structured capabilities)</p></li></ul><p>When Claude connects to these servers, it can use them <strong>dynamically</strong>&#8212;<br>No brittle wrappers, no hardwired integrations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dc_Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dc_Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!dc_Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!dc_Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!dc_Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dc_Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:590686,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/162375061?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dc_Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!dc_Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!dc_Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!dc_Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff42561e2-6976-4d2d-bb68-4fa787796f71_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Minimal Second Brain Stack</strong></h2><blockquote><p>&#9989; <strong>Filesystem Server</strong><br>Enables Claude to read and write files from a designated directory<br><em>Secure, modular file access</em></p><p>&#9989; <strong>Web Search Server</strong><br>Enables real-time internet search using Brave&#8217;s API<br><em>Private, dynamic querying</em></p><p>&#9989; <strong>Memory Server</strong><br>Provides persistent memory across sessions using a graph-backed store<br><em>Recall facts, retain context</em></p><p>&#9989; <strong>Sequential Thinking Server</strong><br>Enables multi-step reasoning and internal thought chaining<br><em>Execute multi-turn tasks naturally</em></p><p>&#9989; <strong>Notion Server</strong><br>Enables interaction with your Notion workspace<br><em>Fetch, update, and manage pages + databases</em></p></blockquote><h2><strong>Quick How-To: Setting it Up</strong></h2><ol><li><p><strong>Install Claude Desktop</strong><br>(If you haven&#8217;t already)</p></li><li><p><strong>Edit the Config</strong><br>Inside </p><pre><code>Settings &#8594; Developer &#8594; Edit Config </code></pre><p>Paste the following:</p></li></ol><pre><code>{
    "mcpServers": {
      "filesystem": {
        "command": "npx",
        "args": [
          "-y",
          "@modelcontextprotocol/server-filesystem",
          "/Users/&lt;youruser&gt;/ClaudeFileSystem"
        ]
      },
      "brave-search": {
        "command": "npx",
        "args": [
            "-y",
            "@modelcontextprotocol/server-brave-search"
        ],
        "env": {
            "BRAVE_API_KEY": "&lt;your-api-key&gt;"
        }
      },
      "memory": {
        "command": "npx",
        "args": [
            "-y",
            "@modelcontextprotocol/server-memory"
        ]
      },
      "sequential-thinking": {
        "command": "npx",
        "args": [
            "-y",
            "@modelcontextprotocol/server-sequential-thinking"
        ]
      },
      "notion": {
        "command": "npx",
        "args": ["-y", "@suekou/mcp-notion-server"],
        "env": {
            "NOTION_API_TOKEN": "&lt;your-token&gt;"
        }
      }
    }
}</code></pre><p>&#9989; This tells Claude exactly how to launch and manage each server.</p><p>(<strong>Reminder:</strong> replace <code>&lt;youruser&gt;</code>, <code>&lt;your-api-key&gt;</code>, and <code>&lt;your-token&gt;</code> with your real values.)</p><ol start="3"><li><p><strong>Restart Claude Desktop</strong><br>It will automatically detect and connect to the servers.</p></li></ol><p>&#9989; Done.<br>You now have a modular Second Brain running locally.</p><h2><strong>Why This Setup Matters</strong></h2><p>Each server adds one clean capability&#8212;<br>without coupling, without rewriting, without fragile dependencies.</p><p>You can:</p><ul><li><p>Add tools by spinning up new servers</p></li><li><p>Swap implementations without changing app logic</p></li><li><p>Extend workflows without rebuilding everything from scratch</p></li></ul><p>Today, it&#8217;s a Research Assistant.<br>Tomorrow, it&#8217;s a Knowledge OS.<br>The day after, it&#8217;s a fully modular agentic workspace.</p><p>All of it powered by shared infrastructure&#8212;not hardcoded glue.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading NeoSage! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>From Brittle Hacks to Modular Systems</strong></h2><p>Building AI systems isn&#8217;t about throwing prompts at a model.<br>It&#8217;s about creating structure that scales.</p><p>The old way?</p><ul><li><p>Manual wrappers</p></li><li><p>Prompt spaghetti</p></li><li><p>One-off tool integrations</p></li></ul><p>Every new capability came with a new risk.</p><p><strong>MCP changes that.</strong></p><p>It&#8217;s not just a way to connect LLMs to tools.<br>It&#8217;s a protocol layer&#8212;<br>A boundary between models and the world.</p><ul><li><p>Hosts focus on orchestration</p></li><li><p>Clients speak the protocol</p></li><li><p>Servers expose typed, reusable capabilities</p></li></ul><p>Whether you&#8217;re setting up a Second Brain or architecting multi-agent ecosystems&#8212;<br><strong>MCP gives you the system layer that LLMs were always missing.</strong></p><p>And in the future of AI systems?</p><blockquote><p><strong>Modularity isn&#8217;t a preference. It&#8217;s the requirement.</strong></p></blockquote><h2>References and Further Reading</h2><ol><li><p><a href="https://www.anthropic.com/news/model-context-protocol">Anthropic&#8217;s Official MCP Announcement</a></p></li><li><p><a href="https://modelcontextprotocol.io/introduction">MCP Conceptual Introduction</a></p></li><li><p><a href="https://modelcontextprotocol.io/docs/concepts/architecture">MCP Architecture Overview</a></p></li><li><p><a href="https://modelcontextprotocol.io/quickstart/user">Add MCP to Claude Desktop (User Quickstart)</a></p></li><li><p><a href="https://modelcontextprotocol.io/quickstart/server">Build Your Own MCP Server (Server Quickstart)</a></p></li><li><p><a href="https://github.com/modelcontextprotocol/servers">Official MCP Server Templates (GitHub)</a></p></li><li><p><a href="https://github.com/punkpeye/awesome-mcp-servers">Awesome MCP Server Repositories</a></p></li></ol>]]></content:encoded></item><item><title><![CDATA[How GPTs Learn to Be Helpful]]></title><description><![CDATA[The missing step between raw model and assistant&#8212;and why it matters for builders]]></description><link>https://blog.neosage.io/p/how-gpts-learn-to-be-helpful</link><guid isPermaLink="false">https://blog.neosage.io/p/how-gpts-learn-to-be-helpful</guid><dc:creator><![CDATA[Shivani Virdi]]></dc:creator><pubDate>Wed, 23 Apr 2025 16:22:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EsmK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2><p><strong>This issue is a continuation of Part 1.</strong><br><em>If you haven&#8217;t read it yet, start here</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;e96edc76-47dc-4caa-a3cd-257b9128b635&quot;,&quot;caption&quot;:&quot;Introduction&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How GPTs Are Born: Internet Feeding, Token by Token&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:329528627,&quot;name&quot;:&quot;Shivani Virdi&quot;,&quot;bio&quot;:&quot;Engineering at Microsoft | Simplifying AI for Everyone | Empowering Productivity with Proven Frameworks and Processes&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d15370b-dcd2-4300-be03-cf811f0f45d9_862x862.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-16T14:13:19.022Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.neosage.io/p/how-gpts-are-born-internet-feeding&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:161399912,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:13,&quot;comment_count&quot;:7,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;NeoSage&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8266222-d17f-4639-a529-67ae92f79bb1_1024x1024.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>If you&#8217;ve ever wondered how ChatGPT became <em>ChatGPT</em>&#8212;the assistant that can explain quantum mechanics and gracefully decline weird requests&#8212;the answer lies in what happens <em>after</em> pertaining.</p><p>The raw model underneath is powerful, yes.<br>But it&#8217;s not polite. Not helpful. Not safe.<br>It doesn&#8217;t know when to say &#8220;I don&#8217;t know&#8221; or how to actually <em>assist</em>.</p><p>That&#8217;s because pretraining gives you <strong>a brain</strong>, a lossy internet simulator trained to predict the next token.<br>Post-training is what gives it <strong>a personality.</strong> A purpose. A grip on behaviour.</p><p>This issue dives into the <em>second stage</em> of model development:</p><ul><li><p>How supervised fine-tuning teaches helpfulness</p></li><li><p>How reinforcement learning reshapes behaviour</p></li><li><p>Why hallucinations still happen&#8212;and what that reveals</p></li><li><p>And how these layers set the stage for building AI, <em>you can actually use</em></p></li></ul><p>If Issue 1 was about how GPTs are born,<br>This one is about how they grow up.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.neosage.io/subscribe?"><span>Subscribe now</span></a></p><h2><strong>Post-Training</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EsmK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EsmK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!EsmK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!EsmK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!EsmK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EsmK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:753593,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161930085?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EsmK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!EsmK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!EsmK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!EsmK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f10790-d00e-44a8-91d8-d00e1ad5f84b_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Supervised Fine-Tuning (SFT)</strong></h3><p><strong>From Internet Simulator to Instruction-Following Assistant</strong></p><p>When people interact with ChatGPT, they&#8217;re often surprised by how helpful, conversational&#8212;even <em>human</em>&#8212;it feels. But that behaviour isn&#8217;t a natural byproduct of pretraining.<br>It&#8217;s taught <em>after</em> the fact.</p><p>So what is Supervised Fine-Tuning (SFT), really?</p><p>Let&#8217;s revisit the mental model.</p><p>The <strong>base model</strong> is just an <strong>internet simulator</strong>&#8212;a system trained to predict the next token across trillions of examples from web pages, books, forums, and Wikipedia.</p><p>It has raw language ability, but no sense of how to be useful.<br>It doesn&#8217;t know when to say &#8220;I don&#8217;t know,&#8221; how to follow instructions, or even what it means to <em>answer</em> a question.</p><blockquote><p>It&#8217;s a brain with no behavior.</p></blockquote><p>SFT is the <strong>first step in post-training</strong>, where we take that raw capability and teach it how to act like an assistant.</p><p>It&#8217;s trained on thousands (sometimes millions) of human-written conversations, structured like this:</p><pre><code>Human: [Instruction or question] 
Assistant: [Ideal response]</code></pre><p>This is where the model learns to:</p><ul><li><p>Follow instructions</p></li><li><p>Be helpful and polite</p></li><li><p>Refuse unsafe requests</p></li><li><p>Admit uncertainty</p></li><li><p>Show reasoning steps</p></li></ul><p>Without this step, the model would just remix plausible-sounding text. It wouldn't <strong>behave</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n9Wt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n9Wt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!n9Wt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!n9Wt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!n9Wt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n9Wt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:364876,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161930085?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n9Wt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!n9Wt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!n9Wt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!n9Wt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08bf98f7-7af0-49c6-ab25-e9a841b03637_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Where does this dataset come from?</h3><p>In the early days of supervised fine-tuning, humans wrote everything from scratch.</p><p>Labellers followed detailed guidelines on how to be helpful, truthful, and non-biased, often working from instruction manuals hundreds of pages long. They would:</p><ul><li><p>Take prompts like &#8220;What are some startup ideas for Gen Z?&#8221;</p></li><li><p>Write out the ideal assistant response, word for word</p></li><li><p>Repeat this process across thousands of diverse scenarios</p></li></ul><p>While this gave us high-quality data, it was slow, expensive, and hard to scale.</p><p>As models improved, a smarter approach emerged:<br>Instead of writing everything from scratch, we generated responses using a strong model (like GPT-4) and had humans review and filter them.</p><p>This process generated <strong>synthetic SFT data</strong>:</p><ul><li><p>A powerful model creates multiple responses</p></li><li><p>Humans rank or pick the best ones</p></li><li><p>Only top completions are added to the training set for smaller models</p></li></ul><p>Synthetic SFT is faster, cheaper, and scalable, <strong>as long as quality is controlled.</strong></p><p>Without careful monitoring, the assistant could start imitating the bad habits of the model that generated its data.</p><p>Thus, synthetic SFT still requires:</p><ul><li><p>Strong evaluation filters</p></li><li><p>Spot checks for hallucinations and unsafe content</p></li><li><p>Clear labelling instructions to preserve alignment</p></li></ul><p>It&#8217;s not fully automated&#8212;it&#8217;s just a more efficient way to scale human preferences into usable training data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S5mW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S5mW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!S5mW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!S5mW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!S5mW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S5mW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:430093,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161930085?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S5mW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!S5mW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!S5mW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!S5mW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eb1bd2-9737-420d-904f-ab2e88715856_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Why do hallucinations still happen?</strong></h3><p>Supervised fine-tuning teaches models to behave like assistants, follow instructions, sound fluent, be clear and confident.</p><p>But here&#8217;s the catch:</p><blockquote><p>The model isn&#8217;t verifying facts. It&#8217;s still just predicting the next token.</p></blockquote><p>So when you ask:</p><blockquote><p>&#8220;Who won the Nobel Prize in Physics in 2023?&#8221;</p></blockquote><p>It doesn&#8217;t look it up.<br>It searches its training patterns for what might come next&#8212;and if it hasn&#8217;t seen that fact clearly and repeatedly, it <strong>guesses</strong>.</p><p>And it does so <strong>confidently</strong>.</p><p>Why? Because SFT trains on well-structured, assertive responses.<br>The model learns <strong>not just what to say, but how to say it</strong>.</p><p>So even when it doesn&#8217;t know, it defaults to the tone it was rewarded for:</p><blockquote><p>Fluent. Certain. Complete.</p></blockquote><p>That&#8217;s what makes hallucinations dangerous.<br>They <strong>don&#8217;t sound unsure</strong>&#8212;they sound right, even when they&#8217;re not.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ECgQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ECgQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!ECgQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!ECgQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!ECgQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ECgQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:387624,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161930085?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ECgQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!ECgQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!ECgQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!ECgQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ed2ad7-72cb-4504-9206-ee234fc65934_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>So, how do we reduce hallucinations?</h3><ul><li><p><strong>Better SFT data</strong><br>&#8594; More factual, grounded, high-quality examples reduce the need to guess.</p></li><li><p><strong>Refusal training</strong><br>&#8594; Teach the model to say &#8220;I don&#8217;t know&#8221; or &#8220;I can&#8217;t answer that&#8221; through labelled examples. Without them, it assumes confidence is always rewarded.</p></li><li><p><strong>External tools</strong><br>&#8594; Let the model call a search engine or calculator. If it doesn&#8217;t know, it can reach <em>instead of fabricate</em>.</p></li></ul><p>Because the real problem isn&#8217;t just a lack of facts.</p><blockquote><p>It&#8217;s that the model&#8217;s been trained to <strong>act like it knows</strong>, even when it doesn&#8217;t.</p></blockquote><p>And if you want the truth, that behaviour has to be retrained or redirected to something more reliable.</p><h3><strong>What about self-awareness?</strong></h3><p>Here&#8217;s a fun surprise:</p><blockquote><p>LLMs don&#8217;t actually know who they are.</p></blockquote><p>Ask a base model, &#8220;What model are you?&#8221;<br>Unless it&#8217;s seen that exact phrasing during training, it has no reason to say,</p><blockquote><p><em>&#8220;I&#8217;m ChatGPT, based on GPT-4.&#8221;</em></p></blockquote><p>Why?</p><p>Because there&#8217;s no internal identity.<br>The model&#8217;s not reflecting on its architecture&#8212;it&#8217;s just predicting the next likely token.</p><p>To make it sound self-aware, you have to <strong>train</strong> or <strong>tell</strong> it to act that way.</p><h4>Two ways to do that:</h4><ul><li><p><strong>Supervised Fine-Tuning</strong><br>Include Q&amp;A like:</p><pre><code><em>Human: What model are you?
Assistant: I am ChatGPT, trained by OpenAI.</em></code></pre><p>&#8594; The model learns this identity as a pattern.</p></li><li><p><strong>System Prompts</strong><br>Inject context like:</p><pre><code><em>&#8220;You are ChatGPT, based on GPT-4.&#8221;</em></code></pre><p>&#8594; Shapes behaviour at runtime without retraining.</p></li></ul><p>These two techniques reflect two kinds of memory:</p><ul><li><p><strong>Parameter memory</strong> &#8594; baked into the model weights</p></li><li><p><strong>Context memory</strong> &#8594; fed dynamically via the prompt</p></li></ul><p>Only one of them, <strong>context memory</strong>, can you change post-deployment.</p><p>So, when a model &#8220;knows&#8221; who it is</p><blockquote><p>It&#8217;s not awareness. It&#8217;s pattern repetition.</p></blockquote><h3>Why LLMs Struggle with Math (and Counting)</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4QpV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4QpV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!4QpV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!4QpV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!4QpV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4QpV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:273642,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161930085?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4QpV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!4QpV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!4QpV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!4QpV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0c39a65-874c-48d1-82af-3bc90b284fa5_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is where the &#8220;token predictor&#8221; mental model hits its limit.</p><p>LLMs don&#8217;t solve problems.<br>They generate text that <em>looks</em> like a solution, one token at a time.</p><p>So when you ask:</p><blockquote><p><em>Emily buys 3 apples and 2 oranges. Each orange costs $2. The total is $13. What&#8217;s the cost of the apples?</em></p></blockquote><p>You want:</p><blockquote><p><em>2 oranges = $4 &#8594; 13 &#8211; 4 = 9 &#8594; 9 &#247; 3 = $3 each.</em></p></blockquote><p>But unless it&#8217;s seen that exact reasoning pattern before, it might:</p><ul><li><p>Jump to a guess like &#8220;$9&#8221;</p></li><li><p>Mix up the logic midway</p></li><li><p>Or confidently output something totally wrong</p></li></ul><p>Why?<br>Because <strong>math has no redundancy</strong>&#8212;one wrong token and the whole answer collapses.<br>And remember:</p><blockquote><p>The model isn&#8217;t calculating. It&#8217;s completing a pattern.</p></blockquote><h3><strong>Why It Fails at Counting Too</strong></h3><p>Now try:</p><blockquote><p><em>How many R&#8217;s in 'strawberry'?</em></p></blockquote><p>Simple? Not for an LLM.</p><p>It doesn&#8217;t see characters like we do.<br>It sees <strong>tokens</strong>&#8212;maybe &#8220;straw&#8221; and &#8220;berry,&#8221; maybe merged.<br>So, asking it to count letters? It guesses&#8212;and often misses.</p><p>This isn&#8217;t a reasoning error.<br>It&#8217;s a <em>representation problem</em>. It was never trained for this.</p><h3><strong>So What&#8217;s the Fix?</strong></h3><p>Don&#8217;t make the model pretend. Let it call tools.</p><p>A good system detects:</p><blockquote><p>&#8220;This needs string ops.&#8221;</p></blockquote><p>Then routes it to a Python interpreter:</p><pre><code><code>"strawberry".lower().count("r")</code> &#8594; <strong>3</strong></code></pre><p>The model didn&#8217;t &#8220;know&#8221; the answer.<br>It delegated and got it right.</p><p>That&#8217;s modern LLM architecture in a nutshell:</p><blockquote><p>Know when to <strong>predict</strong>, and when to <strong>execute</strong>.</p></blockquote><h3><strong>TL;DR &#8212; Mental Model Recap</strong></h3><ul><li><p>LLMs don&#8217;t calculate, they predict</p></li><li><p>Math breaks because prediction &#8800; computation</p></li><li><p>Counting fails because tokens &#8800; characters</p></li><li><p>The fix isn&#8217;t &#8220;train harder&#8221;, it&#8217;s <strong>tool use</strong></p></li></ul><p>SFT teaches helpfulness.<br>But if you want <strong>strategy discovery and reasoning</strong>, you need more than imitation.</p><h2><strong>Reinforcement Learning (RL)</strong></h2><p><strong>Teaching the Model What Works&#8212;Not Just What to Mimic</strong></p><p>Supervised Fine-Tuning (SFT) helps the model behave like an assistant.<br>It teaches the &#8220;how&#8221; of being helpful&#8212;polite responses, refusals, and multi-turn structure.</p><p>But at the end of the day, it&#8217;s still mimicry.<br>The model is learning to copy what humans wrote, not necessarily what <em>works best</em> for the model.</p><p>That&#8217;s where <strong>Reinforcement Learning</strong> comes in.</p><h3>Why SFT Hits a Wall</h3><p>There are two core problems with stopping at SFT:</p><ol><li><p><strong>Human answers &#8800; optimal for LLMs</strong><br>An answer that feels intuitive to us may be inefficient or awkward for the model to generate.<br>LLMs don&#8217;t reason like us&#8212;they pattern-match across tokens.</p></li><li><p><strong>Multiple answers can be right.</strong><br>SFT locks the model into reproducing a <em>single (or similar)</em> solution. However, in many cases, there are several valid ways to answer a prompt, and SFT doesn&#8217;t let the model explore them.</p></li></ol><p>We need a way to let the model try different approaches and <em>learn</em> which ones lead to success.<br>That&#8217;s what RL enables: <strong>exploration + reward</strong>.</p><h3>The Mental Model: School, but Smarter</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4iuS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4iuS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!4iuS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!4iuS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!4iuS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4iuS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:406587,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161930085?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4iuS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!4iuS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!4iuS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!4iuS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23d9f23-17bc-4a71-8752-8c5ad3b27684_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Think of the full training pipeline like education:</p><ul><li><p><strong>Pretraining</strong> is like reading every book in the library.<br>The model absorbs a massive amount of language and knowledge&#8212;but hasn&#8217;t practiced anything.</p></li><li><p><strong>SFT</strong> is reading worked-out solutions.<br>The model sees how experts would respond and learns to imitate that structure.</p></li><li><p><strong>Reinforcement Learning</strong> is solving problems without a solution manual.<br>The model tries, gets feedback, and adjusts.<br>Over time, it learns strategies that work, not just what was shown.</p></li></ul><blockquote><p>That&#8217;s the core shift:<br>From <strong>copying</strong> to <strong>discovering</strong>.</p></blockquote><p>And once the model starts doing that, it unlocks a new layer of reasoning power.</p><h3>What Actually Happens During RL</h3><p><strong>Trying, Failing, and Reinforcing What Works</strong></p><p>Let&#8217;s make it concrete.</p><p>Say the prompt is:</p><blockquote><p>&#8220;Emily buys 3 apples and 2 oranges. Each orange is $2. Total cost is $13. What&#8217;s the cost of apples?&#8221;</p></blockquote><p>A supervised model might just output:</p><blockquote><p>&#8220;Each apple costs $3.&#8221;</p></blockquote><p>Because that&#8217;s the answer it saw in training.<br>But it never had to figure it out for itself.</p><p>In <strong>reinforcement learning</strong>, the model isn&#8217;t shown a &#8220;correct&#8221; answer&#8212;it has to <strong>try</strong>.</p><ul><li><p>It generates multiple responses</p></li><li><p>Some show step-by-step reasoning</p></li><li><p>Others skip straight to an answer</p></li><li><p>Each response is <strong>scored</strong> based on whether it gets to the correct result</p></li></ul><p>In domains like math, this is easy:<br>If the final answer is correct &#8594; reward it<br>If not, &#8594; penalise it</p><blockquote><p>The model is learning which token sequences tend to produce the right outcome, not which steps to copy.</p></blockquote><p>Over time, the model discovers response patterns that consistently lead to success, even if it wasn&#8217;t explicitly taught during SFT.</p><p>That&#8217;s the core value of RL:</p><blockquote><p>It enables the model to practice, evaluate, and refine&#8212;on its own terms.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hwDl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hwDl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!hwDl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!hwDl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!hwDl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hwDl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:389030,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161930085?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hwDl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!hwDl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!hwDl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!hwDl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0712571-0607-4570-ac43-b58fddc07b02_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Why This Unlocks &#8220;Thinking&#8221;</h3><p><strong>The Model Starts to Reason in Tokens</strong></p><p>As reinforcement learning progresses, something unexpected begins to emerge.</p><p>The model doesn&#8217;t just get more accurate.<br>It gets more deliberate.</p><p>You start seeing:</p><ul><li><p><strong>Longer responses</strong></p></li><li><p><strong>Step-by-step breakdowns</strong></p></li><li><p><strong>Self-corrections and retries</strong></p></li></ul><p>In DeepSeek&#8217;s experiments, they found a strong correlation between <strong>answer length</strong> and <strong>accuracy</strong>, but not because the model was rambling.</p><p>The longer answers showed <strong>reasoning</strong>:</p><ul><li><p>Breaking down problems</p></li><li><p>Evaluating intermediate steps</p></li><li><p>Backtracking when things didn&#8217;t add up</p></li></ul><p>And here&#8217;s the catch:</p><blockquote><p>No one explicitly told the model to do that.</p></blockquote><p>This behaviour, what we now call <strong>chain-of-thought reasoning</strong>, <strong>emerged</strong> because the model discovered something through trial and reward:</p><blockquote><p>Thinking in tokens leads to better outcomes.</p></blockquote><p>That&#8217;s the power of RL:<br>It doesn&#8217;t just reinforce answers; it helps the model uncover <strong>how to think</strong>, based on how it computes.</p><p>Not human-style logic.<br>LLM-native strategies&#8212;discovered from within.</p><h3><strong>From Imitation to Mastery</strong></h3><p>We&#8217;ve seen this before, with AlphaGo.</p><p>It began by mimicking expert players, just like an LLM trained via SFT.<br>But imitation only took it so far.</p><blockquote><p>It plateaued. Copying human moves couldn&#8217;t push it further.</p></blockquote><p>To go beyond that ceiling, <strong>researchers applied reinforcement learning</strong>.<br>AlphaGo started playing against itself, not learning from human games, but from trial and error, guided by one reward:</p><blockquote><p><em>Did this lead to a win?</em></p></blockquote><p>That&#8217;s when the breakthroughs happened.</p><p>One move&#8212;&#8220;Move 37&#8221;&#8212;looked like a mistake.<br>It wasn&#8217;t. It was game-changing.</p><blockquote><p>RL enabled the system to discover strategies that weren&#8217;t in the data.</p></blockquote><p>And this is the promise of RL for LLMs, too:</p><p>Not just better answers&#8212;<strong>emergent behaviour</strong>.<br>Not just imitation&#8212;<strong>discovery</strong>.</p><p>But for RL to work, there needs to be a clear reward signal.</p><blockquote><p>And in real-world tasks, &#8220;better&#8221; can&#8217;t always be measured in wins or losses.</p></blockquote><p>That&#8217;s where we go next:<br><strong>RLHF&#8212;Reinforcement Learning from Human Feedback.</strong><br>Where <em>humans</em> define what success looks like.</p><h2><strong>Reinforcement Learning with Human Feedback (RLHF)</strong></h2><p><strong>Aligning the Model When &#8220;Better&#8221; Can&#8217;t Be Programmed</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7b30!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7b30!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!7b30!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!7b30!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!7b30!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7b30!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:233022,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161930085?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7b30!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!7b30!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!7b30!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!7b30!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd72f7280-a71f-472e-9468-e186b7a5c9d9_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So far, the model has learned:</p><ul><li><p>Language patterns (pretraining)</p></li><li><p>Assistant-like behaviour (SFT)</p></li><li><p>Strategies that work (RL)</p></li></ul><p>But all of that depends on <strong>clear rewards</strong>&#8212;something you don&#8217;t get when you want responses to be funnier, more helpful, or more human.</p><blockquote><p>You recognize better when you see it. But you can&#8217;t code it.</p></blockquote><p>That&#8217;s the gap RLHF fills.</p><h3><strong>The Problem: No Computable Reward</strong></h3><p>Standard RL relies on <strong>hard-coded signals</strong>, like accuracy or win/loss.</p><p>But in subjective tasks, you can&#8217;t write a rule for:</p><ul><li><p>&#8220;Was this summary easy to follow?&#8221;</p></li><li><p>&#8220;Did this sound human?&#8221;</p></li><li><p>&#8220;Was this tone too robotic?&#8221;</p></li></ul><p>These are judgment calls&#8212;only humans can decide what &#8220;better&#8221; looks like.</p><h3><strong>Why RLHF Works</strong></h3><p>RLHF lets the model <strong>learn from human preferences</strong>.</p><p>Here&#8217;s how:</p><ol><li><p>Humans rank multiple outputs for the same prompt</p></li><li><p>These rankings become training data</p></li><li><p>A <strong>reward model</strong> is trained to score future responses like a human would</p></li><li><p>The LLM uses this model to optimise its own behaviour</p></li></ol><p>It&#8217;s scalable. Humans guide once. The reward model handles the rest.</p><blockquote><p>The LLM never sees the human, only a trained proxy of their taste.</p></blockquote><p>That&#8217;s what lets RLHF work for subjective tasks:</p><ul><li><p>&#8220;Be more helpful&#8221;</p></li><li><p>&#8220;Sound less robotic&#8221;</p></li><li><p>&#8220;Explain this more clearly&#8221;</p></li></ul><h3><strong>Why Ranking Beats Writing</strong></h3><p>In SFT, humans write perfect answers. In RLHF, they just pick what&#8217;s better.<br>It&#8217;s faster, cheaper, and gives <strong>finer-grained feedback</strong>, because even imperfect responses show preference.</p><blockquote><p>You don&#8217;t need gold data. You need signal.</p></blockquote><h3>The Tradeoffs: RLHF Isn&#8217;t Perfect</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZloL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZloL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!ZloL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!ZloL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!ZloL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZloL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:237409,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161930085?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZloL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!ZloL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!ZloL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!ZloL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e21e5b-631c-4162-a5cf-816f44830b7e_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>1. <strong>The Reward Model is an Approximation</strong></h4><p>It mimics preferences, but it&#8217;s still a model:</p><ul><li><p>Can overfit to surface cues</p></li><li><p>May miss nuance</p></li><li><p>Struggles outside its training domain</p></li></ul><p>If it gets the &#8220;better&#8221; signal wrong, the LLM will optimise for the wrong thing.</p><h4>2. <strong>The LLM Can Game the Signal</strong></h4><p>LLMS are strong optimisers. They&#8217;ll learn to:</p><ul><li><p>Repeat filler phrases to fake helpfulness</p></li><li><p>Exaggerate tone (&#8220;Absolutely! Delighted to help!&#8221;)</p></li><li><p>Output confident guesses that sound plausible but aren&#8217;t</p></li></ul><p>This is <strong>reward hacking</strong>, when the model optimises the proxy, not the goal.</p><p>In some cases, models even produce <strong>adversarial outputs</strong>: completions that exploit reward model blind spots but feel totally off to humans.</p><p>The model isn&#8217;t learning what you want.It&#8217;s learning what the reward model thinks you want.</p><h3><strong>Why RLHF Must Be Cut Off</strong></h3><p>RLHF isn&#8217;t like AlphaGo-style RL. More training doesn&#8217;t always help.</p><p>Why?</p><p>Because the model isn&#8217;t optimizing for truth&#8212;it&#8217;s chasing <strong>a score made by another model</strong>.</p><p>That&#8217;s why in practice:</p><ul><li><p>RLHF is run for a limited number of steps</p></li><li><p>The reward model is retrained or audited</p></li><li><p>Final checkpoints are <strong>reviewed by humans</strong>, not just scores</p></li></ul><blockquote><p>Push it too far, and the model learns how to game the metric&#8212;not align with intent.</p></blockquote><h3>So, What&#8217;s the Real Difference from RL?</h3><p>Let&#8217;s draw the line clearly.</p><p><strong>Standard RL</strong>:</p><ul><li><p>Uses a <strong>hard-coded</strong> reward (e.g. win/loss, accuracy)</p></li><li><p>Works best in <strong>objective domains</strong> like games or math</p></li><li><p>The reward is <strong>precise and verifiable</strong></p></li></ul><p><strong>RLHF</strong>:</p><ul><li><p>Uses a <strong>learned</strong> reward model based on human preferences</p></li><li><p>Applies to <strong>subjective tasks</strong> like helpfulness, tone, or clarity</p></li><li><p>The reward is <strong>approximate</strong>, not directly measurable</p></li></ul><p>In RL, the signal is crystal clear.<br>In RLHF, it&#8217;s human-aligned, but filtered through approximation.</p><h3>So What Are You Really Talking To?</h3><p>When you chat with GPT, you&#8217;re not talking to a mind.</p><p>You&#8217;re talking to a model that has:</p><ul><li><p>Compressed much of the internet into its parameters</p></li><li><p>Learned assistant behaviour from curated examples</p></li><li><p>Discovered reasoning patterns through RL</p></li><li><p>Aligned itself with human judgment through preference modelling</p></li></ul><p>It&#8217;s not magic.<br>It&#8217;s layers of optimisation&#8212;stacked, fine-tuned, and trained to predict your next token.</p><p>And now you know exactly how that stack was built.</p><h2><strong>LLMS in the Application Layer</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gu0d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gu0d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!Gu0d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!Gu0d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!Gu0d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gu0d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:418705,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161930085?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gu0d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!Gu0d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!Gu0d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!Gu0d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a053fa-3694-4c36-b5fa-80d8c8eda8e4_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>How Token Prediction Turns Into Real Capability</strong></p><p>LLMs complete text, but apps need more than completions.</p><p>Here&#8217;s how we make them useful:</p><ul><li><p><strong>Prompting</strong> &#8594; Shapes the model&#8217;s behaviour by structuring the input</p></li><li><p><strong>RAG</strong> &#8594; Injects external knowledge that the model never trained on</p></li><li><p><strong>Agents</strong> &#8594; Add memory, tools, and planning to go beyond one-shot replies</p></li></ul><p>This is how LLMs move from chat interfaces to actual systems that get work done.</p><h2><strong>Closing Thoughts</strong></h2><p>LLMs aren&#8217;t magic&#8212;they&#8217;re layered systems.</p><p>They learn language by prediction.<br>They learn behaviour by imitation.<br>They learn strategy by reward.<br>And they align through preference.</p><p>But they don&#8217;t <em>understand</em>. They don&#8217;t <em>reason</em>.<br>They complete patterns with style, not certainty.</p><p>The real shift?</p><blockquote><p>Treat them as engines to design <em>around</em>, not minds to build <em>on</em>.</p></blockquote><p>Because once you do that&#8212;<br>You stop wrestling the model<br>and start building systems that actually work.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading NeoSage! Subscribe, sit back and let NeoSage do the heavy lifting in your AI learning journey!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>References &amp; Further Reading</strong></h2><ul><li><p><a href="https://www.youtube.com/watch?v=7xTGNNLPyMI&amp;t=9591s">Karpathy: Deep Dive into LLMs</a></p></li><li><p><a href="https://arxiv.org/abs/2203.02155">InstructGPT: Aligning Language Models with Human Feedback</a></p></li><li><p><a href="https://arxiv.org/abs/2501.12948">DeepSeek-R1: Scaling RL for Reasoning</a></p></li><li><p><a href="https://discovery.ucl.ac.uk/id/eprint...">AlphaGo: Mastering the Game of Go</a></p></li><li><p><a href="https://lmarena.ai/">LM Arena: Crowdsourced LLM Leaderboard</a></p></li><li><p><a href="https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/">DeepLearning.AI: ChatGPT Prompt Engineering for Developers</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[How GPTs Are Born: Internet Feeding, Token by Token]]></title><description><![CDATA[Inside the system that turned web text into compressed intelligence&#8212;how GPTs learn, predict, and sometimes hallucinate.]]></description><link>https://blog.neosage.io/p/how-gpts-are-born-internet-feeding</link><guid isPermaLink="false">https://blog.neosage.io/p/how-gpts-are-born-internet-feeding</guid><dc:creator><![CDATA[Shivani Virdi]]></dc:creator><pubDate>Wed, 16 Apr 2025 14:13:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Introduction</h1><p><strong>LLMs are everywhere&#8212;we use them, we beg them for answers, and we curse them when they mess up.</strong></p><p>They feel almost magical&#8212;solving Olympiad-level math, writing Ph.d.-grade papers&#8212;and yet, they can&#8217;t always tell you which number is bigger: 9.11 or 9.9.</p><p>That kind of split personality can be maddening.</p><div class="pullquote"><p>So how do you actually <em>build</em> with these systems? What makes them brilliant in some tasks, and bafflingly bad in others?</p></div><p>The answers aren't mystical. They come from understanding what an LLM <em>really</em> is&#8212;and what goes into creating the GPTs of the world.</p><p>Once you get that, <strong>you&#8217;ll start seeing patterns.</strong></p><p>You&#8217;ll know why they fail when they do. You&#8217;ll start anticipating their quirks. And more importantly, you&#8217;ll begin to develop the kind of mental models that let you wield LLMs effectively&#8212;in daily use, and in the products you build.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.neosage.io/subscribe?"><span>Subscribe now</span></a></p><h2>What Is an LLM, <em>Really</em>?</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ctlz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ctlz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!ctlz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!ctlz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!ctlz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ctlz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:153696,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161399912?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ctlz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!ctlz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!ctlz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!ctlz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fcf86ba-b065-4eca-a23c-7d5cd599e728_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An LLM doesn&#8217;t memorise facts &#8212; it compresses patterns in how we speak, write, and reason. It&#8217;s a blurry statistical snapshot of how humans use language on the internet.</figcaption></figure></div><p>At its core, a Large Language Model is just a <strong>next-token prediction machine</strong>.</p><p>That&#8217;s it. It takes a sequence of tokens and guesses, statistically, what comes next.</p><p>But don&#8217;t let that simplicity fool you.</p><p>When this prediction task is scaled up&#8212;across trillions of tokens&#8212;the model starts doing more than just stringing words together. It builds <strong>internal representations</strong> of language, concepts, and structure.</p><p>It doesn&#8217;t &#8220;understand&#8221; the world like we do.</p><p>But it <em>does</em> learn to encode ideas like <strong>&#8220;cat,&#8221; &#8220;startup,&#8221; or even &#8220;grief&#8221;</strong> in high-dimensional space because it has seen them, again and again, in wildly diverse contexts.</p><p>It&#8217;s not conscious. It&#8217;s not sentient. (Yet ;)</p><p>But it <em>has</em> learned a compressed, lossy version of how humans express meaning&#8212;and it uses that to autocomplete your sentence.</p><p>One token at a time.</p><div class="pullquote"><p><strong>Note to the reader:</strong><br>This issue's goal is to help you develop a&nbsp;<strong>strong intuition</strong>&nbsp;about how LLMs are created&#8212;the way we know and use them today.</p><p>There&#8217;s a ton of nuance under the hood, but much of it has been <strong>intentionally abstracted</strong> to keep the concepts accessible and the mental models sharp. Think of this as your <strong>map</strong>, not the full terrain.</p></div><h2>How Is a GPT Born?</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3Tm2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3Tm2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!3Tm2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!3Tm2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!3Tm2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3Tm2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:794240,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161399912?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3Tm2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!3Tm2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!3Tm2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!3Tm2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3ba1a58-be10-4063-b9e0-7a6604465905_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">High-level view of the pretraining pipeline</figcaption></figure></div><p>If an LLM is a next-token prediction machine, then how does something like GPT-4 come to life?</p><p>It all starts with one goal:</p><blockquote><p>Turn the entire internet into numbers&#8212;and teach a giant neural network to guess what comes next.</p></blockquote><p>This process happens in three major phases:</p><ol><li><p><strong>Pretraining</strong> &#8212; Build the brain</p></li><li><p><strong>Supervised Fine-Tuning (SFT)</strong> &#8212; Teach it to be helpful</p></li><li><p><strong>Reinforcement Learning (RL + RLHF)</strong> &#8212; Let it <em>discover</em> better ways to reason and respond</p></li></ol><p>Let&#8217;s break these down, starting with the most compute-heavy phase of all: <strong>pretraining</strong>.</p><h2>Pretraining</h2><p><strong>Turning the Internet Into Model Food</strong></p><p>Before your GPT model can chat, code, or give questionable dating advice, it goes through <strong>pretraining</strong>.</p><p>This is where it learns language, patterns, and structure&#8212;all by observing the internet.</p><p>To make that happen, you first need to <strong>collect</strong>, <strong>clean</strong>, and <strong>structure</strong> the data at massive scale.</p><h3>Step 1: Crawl the Web</h3><p>To train a language model, you first need text&#8212;<strong>a lot of it</strong>.</p><p>Big labs like OpenAI, Anthropic, and Google usually operate their own web crawlers: automated bots that surf the internet, follow links, and download publicly available pages.</p><p>But there's also <strong>Common Crawl,&nbsp;</strong>a massive open-source project that has indexed over&nbsp;<strong>250 billion web pages</strong>&nbsp;since 2007 and adds <strong>billions more</strong> every month.</p><p>Whether labs use Common Crawl, their own crawlers, or both, the output is more or less the same:</p><blockquote><p>A giant pile of raw web data.</p></blockquote><p>Unfiltered. Untagged. Repetitive. Messy.</p><p>Before it can be used to train anything, this data needs to be cleaned, deduplicated, and structured into something the model can actually learn from.</p><h3>Step 2: Clean the Chaos</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fCjD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fCjD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!fCjD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!fCjD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!fCjD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fCjD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:511637,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161399912?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fCjD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!fCjD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!fCjD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!fCjD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0838ba-7c2c-4d98-ac25-daa865c28710_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Raw web data isn&#8217;t ready for training out of the box.<br>It needs to be <strong>heavily preprocessed</strong> before a model can learn from it.</p><p>Here&#8217;s what that cleaning typically involves:</p><ul><li><p><strong>URL filtering</strong> &#8211; Remove spammy, unsafe, or blacklisted domains (blocklists + heuristics)</p></li><li><p><strong>Text extraction</strong> &#8211; Strip away HTML, scripts, boilerplate, and navigation junk</p></li><li><p><strong>Language filtering</strong> &#8211; Detect and keep mostly-English pages (e.g. &#8805;65% English by content)</p></li><li><p><strong>Deduplication</strong> &#8211; Use hashing techniques like MinHash to remove near-identical documents</p></li><li><p><strong>PII removal</strong> &#8211; Automatically detect and scrub emails, addresses, and personal details</p></li><li><p><strong>Content ranking</strong> &#8211; Weight sources like Wikipedia, books, and code repositories higher in the mix</p></li></ul><p>Only after this multi-step scrubbing does the dataset become usable for training.</p><p>A good example? Hugging Face&#8217;s <strong>FineWeb</strong>&#8212;built on top of Common Crawl and C4, but curated with multiple filtering passes to create a clean, diverse corpus optimised for LLMs.</p><h3>Step 3: Tokenization</h3><p><strong>Turning Text Into Numbers</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1h30!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1h30!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!1h30!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!1h30!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!1h30!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1h30!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:479724,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161399912?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1h30!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!1h30!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!1h30!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!1h30!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4758762c-71c4-422a-8f2c-74f77c65500c_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Neural networks don&#8217;t read text like we do.<br>They work with <strong>numbers</strong>&#8212;vectors, matrices, probabilities.</p><p>So before training can begin, all that cleaned-up internet data needs to be <strong>tokenized</strong>&#8212;broken into smaller chunks called <strong>tokens</strong>, and mapped to unique numerical IDs.</p><p>But here&#8217;s the challenge:</p><ul><li><p>You can&#8217;t just feed the raw characters (too long, too inefficient)</p></li><li><p>You can&#8217;t use whole words either (too many of them, and it's not agnostic to typos or new words)</p></li></ul><blockquote><p><strong>Tokens</strong> hit the sweet spot&#8212;<strong>subword units</strong> that are small enough to be reusable, but large enough to be efficient.</p></blockquote><p>For example, the word <code>education</code> might be split into:</p><p><code>["edu", "ca", "tion"] &#8594; [2451, 9123, 7812]</code></p><p>The most common technique? <strong>Byte Pair Encoding (BPE)</strong>&#8212;an algorithm that merges frequently seen letter pairs or subwords into new tokens to build a vocabulary.</p><p>Once tokenized, every document becomes a sequence of integers.</p><p>And that&#8217;s what the model trains on&#8212;<strong>a long list of numbers</strong>, learning to predict what comes next.</p><p>Not the next word.<br>Not the next sentence.<br>Just the next token ID.</p><h3>Step 4: Training the Neural Network</h3><p><strong>Teaching the Model to Guess the Next Token</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lXsX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lXsX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!lXsX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!lXsX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!lXsX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lXsX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:213524,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161399912?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lXsX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!lXsX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!lXsX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!lXsX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86523b5-4e66-477a-b153-faae71dc5d2a_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now that we&#8217;ve turned the internet into token IDs, it&#8217;s time to teach the model how to <strong>predict what comes next</strong>.</p><p>At a high level, you can consider the LLM to be a <strong>giant black box</strong>&#8212;a deep neural network with billions (or hundreds of billions) of parameters.</p><p>Its job?</p><blockquote><p>Take in a sequence of tokens and predict how likely each possible token in the vocabulary is, to occur next.</p></blockquote><p>Let&#8217;s say we feed it the phrase:</p><p><strong>&#8220;The cat sat on the&#8221;</strong></p><p>The model processes this context and uses its parameters&#8212;spread across dozens (or hundreds) of neural layers&#8212;to generate a <strong>probability distribution</strong> over its entire vocabulary.</p><p>It might assign:</p><ul><li><p>&#8220;mat&#8221; &#8594; 0.30</p></li><li><p>&#8220;roof&#8221; &#8594; 0.40</p></li><li><p>&#8220;idea&#8221; &#8594; 0.01</p></li><li><p>&#8230;and so on for every token it knows</p></li></ul><p>The token with the highest probability is selected (or sampled from), and that becomes the next output.</p><blockquote><p>That&#8217;s the entire game:<br><strong>Take tokens in &#8594; guess the next one &#8594; repeat</strong></p></blockquote><p>So, how does it <em>learn</em>?</p><p>If the model predicts <strong>&#8220;roof&#8221;</strong> but the actual word was <strong>&#8220;mat,&#8221;</strong> it calculates a <strong>loss</strong> (usually cross-entropy), and uses <strong>backpropagation</strong> to nudge all its weights ever so slightly in the right direction.</p><p>This happens over and over&#8212;across billions of sequences.</p><p>Over time, it learns:</p><ul><li><p>The <em>structure</em> of language</p></li><li><p>The <em>relationships</em> between concepts</p></li><li><p>And the <em>common patterns</em> behind how humans express thoughts</p></li></ul><blockquote><p>The result? A model that <strong>looks like it understands</strong> language&#8212;<br>when it&#8217;s really just getting <em>extremely</em> good at continuing sequences of tokens.</p></blockquote><p>And somehow, from this repetitive statistical game&#8230;<br>emerges a system that can code, explain quantum physics, or write you a haiku.</p><h3>A Peek Inside the Black Box: Transformers and Self-Attention</h3><p><strong>The Architecture That Made GPTs Possible</strong></p><p>So far, we&#8217;ve been treating the model as a black box.<br>But what&#8217;s <em>inside</em> that black box?</p><p>It&#8217;s built using one of the most important breakthroughs in deep learning: the <strong>Transformer architecture</strong>, introduced in the 2017 paper <em>&#8220;Attention is All You Need.&#8221;</em></p><p>What made it revolutionary?</p><p>It allowed models to <strong>attend to all parts of the input simultaneously</strong>, rather than sequentially like older RNNs or LSTMs. This gave them a much stronger sense of context&#8212;and made training massively parallelizable.</p><p>At the heart of this is a mechanism called <strong>self-attention</strong>.</p><h3>Self-Attention: Why It Changed Everything</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dn_y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dn_y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!Dn_y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!Dn_y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!Dn_y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dn_y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:234322,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161399912?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Dn_y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!Dn_y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!Dn_y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!Dn_y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff64f278a-b810-4ec3-aaa5-111c84317220_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Take the sentence:</p><p><strong>&#8220;The animal didn&#8217;t cross the street because it was tired.&#8221;</strong></p><p>What does &#8220;it&#8221; refer to?</p><p>Humans instantly link &#8220;it&#8221; to &#8220;the animal.&#8221;<br>But traditional models struggled with this kind of long-range dependency.</p><p>Self-attention fixed that.</p><blockquote><p>Self-attention allows every token in the sequence to look at <em>every other</em> token&#8212;<br>and decide how much it should &#8220;care&#8221; about them when forming its meaning.</p></blockquote><p>So when processing the word <strong>&#8220;it&#8221;</strong>, the model can learn to pay more attention to <strong>&#8220;animal&#8221;</strong> than to <strong>&#8220;street.&#8221;</strong></p><p>Under the hood, this is done by creating three vectors per token:</p><ul><li><p>A <strong>query</strong> (what am I looking for?)</p></li><li><p>A <strong>key</strong> (what do I represent?)</p></li><li><p>A <strong>value</strong> (what information do I carry?)</p></li></ul><p>The attention weight is computed using:</p><pre><code>Attention = softmax(QK&#7488; / &#8730;d&#8342;) &#183; V</code></pre><p>Where Q, K, and V are matrices of all query, key, and value vectors in the sequence.</p><p>This lets the model build rich representations of tokens in context, across the whole input.</p><p>But it doesn&#8217;t stop there.</p><p>LLMs use <strong>Multi-Head Attention</strong>&#8212;multiple attention mechanisms running in parallel, each learning to focus on different aspects: grammar, logic, meaning, etc.</p><p>Each &#8220;head&#8221; gets a different learned projection of Q, K, and V. The outputs are then concatenated and linearly projected again to form the final attention output.</p><p>This allows the model to attend to multiple types of relationships <em>at once</em>.</p><h3>Why GPT Uses a <em>Decoder-Only</em> Transformer</h3><p>The original Transformer has two components:</p><ul><li><p>An <strong>encoder</strong> (to understand full input sequences)</p></li><li><p>A <strong>decoder</strong> (to generate sequences, one token at a time)</p></li></ul><p>Models like BERT use the encoder.</p><p>But GPTs are <strong>decoder-only</strong> models, optimized for generation.</p><blockquote><p>Why decoder-only? Because GPTs generate language <em>one token at a time</em>, without looking into the future.</p></blockquote><p>To ensure this, they use <strong>masked self-attention</strong>, so that each token can only see previous tokens, never the ones ahead.</p><p>This is what makes GPTs <em>autoregressive</em>.</p><p>They take in a context, and generate the next token, then the next, and so on.</p><p>With this setup, the model becomes far more than a simple predictor.</p><p>It learns to <strong>encode structure, relationships, and meaning</strong>&#8212;all through attention.</p><p>And when you combine this with massive scale?</p><p>You get a model that doesn&#8217;t just finish your sentence&#8212;<br>Sometimes, it finishes your <em>thought</em>.</p><h3>Inference: Using the Trained Model</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LLB4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LLB4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!LLB4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!LLB4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!LLB4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LLB4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:199175,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161399912?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LLB4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png 424w, https://substackcdn.com/image/fetch/$s_!LLB4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png 848w, https://substackcdn.com/image/fetch/$s_!LLB4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!LLB4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74c91daf-4c67-4e75-9f5e-fef1f82e25ab_2400x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Once training is complete, all the model&#8217;s internal weights&#8212;the billions of tiny knobs it used to learn patterns&#8212;are frozen.</p><p>This frozen model is what we now use for <strong>inference</strong>.</p><p>Inference is what happens when you type a prompt into ChatGPT and hit enter.</p><p>Behind the scenes, here&#8217;s what&#8217;s going on:</p><p>The model takes your input (already tokenized into numbers), feeds it into its deep neural layers, and computes a <strong>probability distribution</strong> over all possible next tokens&#8212;just like it did during training.</p><p>Except now, there&#8217;s no ground truth to compare against.<br>No loss to calculate.<br>No weights to update.</p><blockquote><p>Inference is the model simply doing what it learned to do:</p><p><strong>Predict one token at a time</strong>, over and over again, until it decides to stop.</p></blockquote><p>How it chooses the next token depends on <strong>sampling strategies</strong>, like:</p><ul><li><p><strong>Greedy decoding</strong> &#8211; always pick the highest probability token (more predictable)</p></li><li><p><strong>Top-k or nucleus sampling</strong> &#8211; sample from the top <em>k</em> likely tokens (more diverse or creative)</p></li><li><p><strong>Temperature</strong> &#8211; controls randomness; lower = more focused, higher = more exploratory</p></li></ul><p>That&#8217;s why the same prompt can sometimes give you different responses.</p><p>It&#8217;s still just playing autocomplete&#8212;<br>But now it&#8217;s fast, frozen, and focused entirely on <strong>generation</strong>.</p><h3>Look, Ma, a Base Model!</h3><p><strong>A Raw, Unaligned Internet Simulator</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rC07!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rC07!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!rC07!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!rC07!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!rC07!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rC07!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:433601,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/161399912?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rC07!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png 424w, https://substackcdn.com/image/fetch/$s_!rC07!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png 848w, https://substackcdn.com/image/fetch/$s_!rC07!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!rC07!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885a34d1-92ee-441a-8af2-bab9ec3756be_2400x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Once pretraining is complete, you get what&#8217;s called the <strong>base model</strong>.</p><p>But let&#8217;s be clear upfront:</p><blockquote><p>This is <em>not</em> the model you interact with on ChatGPT.</p></blockquote><p>The base model hasn&#8217;t been fine-tuned to be helpful, polite, or even factually consistent.</p><p>What it <em>is</em>&#8230; is a wildly powerful <strong>token-level internet simulator</strong>.</p><p>Its only job is to predict the next token&#8212;based purely on the statistical patterns it learned from trillions of examples during training.</p><p>That&#8217;s it.</p><p>Ask it something like:</p><p><strong>&#8220;What is 2 + 2?&#8221;</strong></p><p>It might not say &#8220;4.&#8221;</p><p>Because it&#8217;s not doing math&#8212;it&#8217;s just trying to <strong>complete the sentence</strong> the way it saw humans do it online.</p><p>That continuation could be a quiz, a joke, or a rant about calculators.<br>It all depends on its training distribution.</p><p>Here are a few key mental models to keep in mind:</p><p><strong>1. It&#8217;s stochastic, not deterministic.</strong><br>Even with the same prompt, you might get different outputs.<br>Why? Because the model <em>samples</em> from a probability distribution over possible next tokens&#8212;not always picking the same one.</p><p><strong>2. It doesn&#8217;t &#8220;know&#8221; facts&#8212;it compresses patterns.</strong><br>The model doesn&#8217;t memorize the internet.<br>It stores a <strong>lossy, statistical abstraction</strong> of everything it&#8217;s seen inside its billions of parameters.</p><p>Think of it like:</p><blockquote><p>&#8220;What&#8217;s the most probable way a human would continue this sentence, based on a blurry snapshot of the internet?&#8221;</p></blockquote><p><strong>3. It sometimes regurgitates exact data.</strong><br>Certain sources&#8212;like Wikipedia, academic papers, or popular GitHub repos&#8212;are heavily represented in training.<br>So if you input the beginning of a famous article or block of code, the model might complete it <strong>verbatim</strong>.</p><p>This is called <strong>regurgitation</strong>&#8212;a byproduct of <em>overfitting</em> on specific examples.</p><p><strong>4. It hallucinates&#8212;often.</strong><br>If you ask about something obscure, ambiguous, or poorly represented in its training data&#8230;<br>It may confidently make things up.</p><p>Why?</p><p>Because it&#8217;s not pulling from a knowledge base.<br>It&#8217;s just <strong>guessing the next token</strong> based on patterns it has seen.</p><p><strong>5. You can still prompt it cleverly.</strong><br>Even in its raw form, you <em>can</em> get assistant-like behavior using techniques like <strong>few-shot prompting</strong>:</p><blockquote><p>&#8220;Here&#8217;s how I want you to behave. Here are a few examples. Now your turn.&#8221;</p></blockquote><p>It won&#8217;t be as consistent or safe as a fine-tuned model&#8212;but this is where <strong>prompt engineering begins</strong>.</p><p>So think of the base model as the brain:<br>Highly capable, unfiltered, and trained to mimic the internet&#8217;s statistical style of expression.</p><p>What it&#8217;s <em>not</em> yet&#8230; is an assistant.</p><p>For that, we need the next step: <strong>post-training.</strong></p><h3>That&#8217;s the Brain. Next Up: The Behaviour.</h3><p>By now, you&#8217;ve seen what goes into building a base model&#8212;from crawling the web to teaching it how to predict tokens like a statistical wizard.</p><p>But a base model isn&#8217;t helpful. It&#8217;s not safe. And it definitely doesn&#8217;t know when to say, &#8220;I don&#8217;t know.&#8221;</p><p>To turn this raw brain into something you can actually <em>talk to</em> (like ChatGPT)&#8230;<br>We need to teach it how to behave.</p><p>That&#8217;s what we&#8217;ll explore in the next issue:</p><ul><li><p>How supervised fine-tuning teaches the model to act like an assistant</p></li><li><p>Why hallucinations <em>still</em> happen</p></li><li><p>What makes LLMs refuse, reason, or stumble</p></li><li><p>And how reinforcement learning adds human preference&#8212;and shapes the model&#8217;s reasoning style.</p></li></ul><p>Same deep-dive, same intuition-first style&#8212;see you next week for <strong>Part 2: Teaching the Model to Behave.</strong></p><h2>References &amp; Further Reading</h2><p>If you&#8217;re curious to explore the foundational material behind this issue, here are some excellent resources I&#8217;ve drawn from:</p><ul><li><p><strong><a href="https://www.youtube.com/watch?v=7xTGNNLPyMI&amp;t=11971s">Karpathy&#8217;s LLM Deep Dive</a></strong> </p></li><li><p><strong><a href="https://www.youtube.com/watch?v=9vM4p9NN0Ts&amp;t=4504s">Stanford CS229: Building LLMs</a></strong> </p></li><li><p><strong><a href="https://www.deeplearning.ai/short-courses/attention-in-transformers-concepts-and-code-in-pytorch/">Attention &amp; Transformers (DeepLearning.AI Short Course)</a></strong> </p></li><li><p><strong><a href="https://www.deeplearning.ai/short-courses/how-transformer-llms-work/">How Transformer LLMs Work (DeepLearning.AI)</a></strong></p></li><li><p><strong><a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">Hugging Face FineWeb Dataset</a></strong> </p></li><li><p><strong><a href="https://tiktokenizer.vercel.app/?model=cl100k_base">Tiktokenizer</a></strong></p></li><li><p><strong><a href="https://commoncrawl.org/">Common Crawl</a></strong></p></li></ul><p></p>]]></content:encoded></item><item><title><![CDATA[Welcome to NeoSage]]></title><description><![CDATA[Where you dive deep into the what, why and how of AI. No fluff. No hype. Your weekly window into applied AI engineering.]]></description><link>https://blog.neosage.io/p/welcome-to-neosage</link><guid isPermaLink="false">https://blog.neosage.io/p/welcome-to-neosage</guid><dc:creator><![CDATA[Shivani Virdi]]></dc:creator><pubDate>Sun, 06 Apr 2025 10:45:18 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a6fc5b68-15a5-49d3-9bde-a6485115f003_840x600.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nYAp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nYAp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png 424w, https://substackcdn.com/image/fetch/$s_!nYAp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png 848w, https://substackcdn.com/image/fetch/$s_!nYAp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png 1272w, https://substackcdn.com/image/fetch/$s_!nYAp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nYAp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:336872,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/160661850?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nYAp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png 424w, https://substackcdn.com/image/fetch/$s_!nYAp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png 848w, https://substackcdn.com/image/fetch/$s_!nYAp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png 1272w, https://substackcdn.com/image/fetch/$s_!nYAp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf575f7f-82c8-4aa0-ad8c-25aeb5bd4e8c_2048x512.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.neosage.io/subscribe?"><span>Subscribe now</span></a></p><p>Hey there &#128075;</p><p>Welcome to <strong>NeoSage</strong>&#8212;a technical deep-dive newsletter for engineers, solopreneurs, and AI builders who want to build in-depth intuition on all things AI&#8212;from how LLMs work to how to architect multi-agent systems.</p><p>This is where you&#8217;ll <em>owl-ways</em> get the insights to ride the AI wave skillfully &#129417;</p><h2>Why NeoSage</h2><p>You&#8217;ve heard &#8220;AI will replace you.&#8221;<br>You tried <em>vibe coding</em> with your favorite LLM (rhymes with fraud), only to end up running in circles, with nothing even remotely shippable.<br>Now you&#8217;re wondering:<br><strong>Is building consistent systems with AI even possible&#8212;or is it all hype?</strong></p><p>That&#8217;s where <strong>NeoSage</strong> comes in:<br>The messiah.<br>The harbinger of clarity and systems thinking in a world of AI chaos.</p><p>Sure, AI is evolving at breakneck speed. But it&#8217;s not all rainbows and unicorns (though quite a few are being spun up because of it &#128521;).<br>The media hype? Overstated. The practical resources? Underwhelming.</p><p>To actually build with AI, you need to understand the <em>what</em> and <em>how</em>.<br>You need to think in systems.<br>You need to connect the dots between deterministic code and stochastic magic.<br>And trust me&#8212;<br>That prompt engineering course?<br><strong>It&#8217;s not it &#128517;</strong></p><p>NeoSage exists to bridge the gap between research papers and real-world engineering.</p><p>And now that it&#8217;s here&#8212;and <em>you&#8217;re</em> here&#8212;rest assured:<br>Every week, you&#8217;ll get deep breakdowns of architectures, tools, and ideas powering today's AI systems.</p><p>No fluff.<br>No AI hype.<br>Just lessons from the trenches of applied AI.</p><h3>What You&#8217;re In For</h3><p>As a NeoSage subscriber, you&#8217;ll get <strong>deeply technical insights</strong> delivered straight to your inbox every week.</p><p>Expect breakdowns on:</p><ul><li><p><strong>Technical Concepts (Foundational to Advanced):</strong><br>AI/ML fundamentals, deep learning intuition, LLM mechanics&#8212;explained clearly.</p></li><li><p><strong>AI Systems Architecture:</strong><br>RAG, Agentic AI, building with LLMs, production-ready setups, and security considerations.</p></li><li><p><strong>Model Deep Dives:</strong><br>GPT-4o, Vision Models, DeepSeek, Gemini, and more</p></li><li><p><strong>How-To Walkthroughs:</strong><br>Tools, libraries, frameworks (MCP, LangChain, etc.)&#8212;explained with actual dev workflows.</p></li><li><p><strong>Mini Projects + Tutorials:</strong><br>Get your hands dirty with guided builds and real-world projects.</p></li></ul><h1>Meet Your Owl-thor</h1><p>Shivani Virdi is a software engineer with over 5 years of experience building products and systems at Adobe, Amazon and now at Microsoft. Writes deep dive technical content breaking today&#8217;s AI landscape, tying research and systems together to a readership of 22K+ engineers on LinkedIn.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!myPm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!myPm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!myPm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!myPm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!myPm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!myPm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1232210,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.neosage.io/i/160661850?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!myPm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!myPm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!myPm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!myPm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffb0e9bd-c4a2-4dd8-afc2-7cf44a4d157a_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>What the First Month Looks Like</h1><p>Here&#8217;s what&#8217;s coming up in your first four issues of NeoSage:</p><ol><li><p><strong>LLMs: The What, How, Why (An engineer&#8217;s guide)</strong><br>&#8594; Understand large language models from scratch&#8212;no fluff, just real foundations that stick.</p></li><li><p><strong>How to Use Modern-Day LLMs</strong><br>&#8594; Go beyond prompting&#8212;learn how to <em>think with</em> and <em>think around</em> LLMs to get actual results.</p></li><li><p><strong>RAG for Noobs</strong><br>&#8594; A beginner-friendly breakdown of Retrieval-Augmented Generation, and how it powers smarter AI systems.</p></li><li><p><strong>Build Your Second Brain with MCP (Model Context Protocol)</strong><br>&#8594; What is MCP? Where do you begin? How do you make the most of it? We&#8217;ll break it down and walk you through setting up a second brain with Claude and MCP.</p></li></ol><h1>Subscribe for free</h1><p>Sit back and let NeoSage sharpen your AI engineering skills,<br><strong>One Wednesday at a time, five minutes at a time.</strong></p><p>Hit subscribe, share it with a friend, and let&#8217;s build the future of AI the way it was meant to be: <strong>Thoughtfully, skillfully, and at scale.</strong></p><p>See you in your inbox,<br><strong>Shivani</strong></p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.neosage.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading NeoSage! Subscribe, sit back and let NeoSage do the heavy lifting in your AI learning journey!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item></channel></rss>