The Flywheel Architecture: Building AI Systems That Improve Themselves

Recently I heard about how both Anthropic (https://www.anthropic.com/institute/recursive-self-improvement) and Cursor use their own models to improve their software, and it got me wondering: is this the next phase of where AI is headed? I started brainstorming what a system like that might look like for a team that isn't building frontier models, and that's what led to this post.

Anthropic calls the broader concept Recursive Self-Improvement (RSI), which is where models write code to build better models, eventually designing their own successors. What I'm interested in operates at a different scope. The industry calls it an AI Flywheel - a self-reinforcing loop where AI outputs feed back into improving the system. The concept isn't new, but what I haven't found is a clear, practical architecture for how to actually build one (the eval framework, the improvement loops, the human oversight, and the state management). In this blog post I lay out the framework of how this could be done easily for simple prompts or eventually used in a fine-tuning concept.

I see the flywheel working in two phases:

Prompt improvement via eval frameworks: Every output gets scored against a rubric by an independent judge model. Failure patterns accumulate, and the system uses those patterns to refine its own prompts, no human editing required. This loop is fast and free.
Model improvement via fine-tuning: The same evaluation data that drives prompt refinement also produces labeled training pairs. Outputs that pass become positive examples; outputs that fail become negative examples. Together they feed techniques like DPO (Direct Preference Optimization) to fine-tune the model itself, creating a deeper loop of improvement that goes beyond what better prompts alone can achieve.

Humans are still in the loop, but the role shifts. Instead of writing code or manually tweaking prompts for every iteration, they provide oversight, confirming or correcting the AI's own evaluations, ensuring quality, and making sure the model isn't overfitting to the evaluation results. The goal isn't to remove humans; it's to make their involvement count where it matters most.

AI Maturity Model

Another concept I’ve been pondering about is everyone’s maturity with AI, and I think there is a clear spectrum emerging:

All-human development - developers write everything, AI isn’t in the workflow
AI-assisted: AI suggested and reviews, but humans author and ship. This is GitHub Copilot, AI code review, chat-based Q&A. The developer is still driving.
AI co-development: AI drafts code, humans review and commit. Tools like Cursor and Claude Code live here. The AI is producing real output, but a human decides what ships. Some might call this the level where Vibe Coding begins.
AI-led delivery: AI commits code, humans gate pull request. The human role shifts from writing to reviewing.
Autonomous agents - agents deliver end-to-end, humans set goals and boundaries. Usually AI agents will represent many roles in this architecture, including BA, Frontend Engineer, Backend, Architect, QA and Code Reviewer.

Most companies and teams today are somewhere between Levels 2 and 3. Some organizations are still stuck at Level 1, writing all their own code with no AI in the loop whatsoever. Usually these are heavily regulated industries or teams that simply haven't invested the time to explore the capabilities yet.

At each level, humans hand off more control to AI. But it's not really about losing control, it's about building trust with the models and their outputs. Each level requires a new kind of trust: trusting suggestions at Level 2, trusting drafts at Level 3, trusting commits at Level 4, trusting full autonomy at Level 5.

But here's what every level from 1 through 5 has in common: the only thing that improves over time is the humans. They get better at prompting, better at reviewing, better at setting boundaries. The AI itself stays the same. Same model, same prompts, same quality on day 100 as on day 1. Every bad output a developer rejects, every subtle fix they make, that signal vanishes. It never feeds back into the system.

So what would it look like if it did?

Introducing the Flywheel Architecture - Level 6

Level 6 is where AI starts improving itself. Not in the sci-fi sense, in a practical, measurable, buildable sense. The system produces outputs, evaluates its own quality, identifies where it's falling short, and uses that signal to get better. Every interaction feeds back into the system. Every failure teaches it something. The more you use it, the better it gets.

I see three loops within a Flywheel Architecture, each operating at a different depth. You don't need all three. Loop 1 alone can transform the quality of your AI outputs. But each loop builds on the one before it, and the eval framework is the constant that ties them all together.

Loop 1: Prompt Refinement - The Easiest Win

This is where most teams should start, and honestly, where most teams will get more value than they expect.

Think about it this way. If you've ever written a system prompt, a skill, or an instruction set for an AI model, you've gone through the same cycle manually: write the prompt, test it, see where it falls short, tweak it, test again. You're the eval framework. You're the feedback loop. And every improvement lives in your head until you remember to update the prompt.

Now imagine that loop running automatically. Every time the system produces an output, an independent judge model scores it against a rubric, a structured set of quality criteria you define. Did it follow the format? Did it handle edge cases? Did it produce valid output? The judge doesn't just pass or fail, it explains why something failed. "Schema used incorrect field type for rich text content." "Component missed the responsive breakpoint visible in the screenshot."

Those failure reasons accumulate. After a few dozen outputs, patterns emerge: 35% of failures cite the same structural error. 20% miss the same category of content. The system takes those patterns and generates a revised prompt that addresses them, no human editing the prompt, no manual iteration. The new prompt version goes active, and the system tracks whether pass rates improve.

This is fast and free (relatively). No model training, no GPU costs, no ML infrastructure. Just better instructions, driven by real failure data instead of gut feel. If you're building AI skills, agents, or any kind of structured AI workflow, this loop alone can dramatically improve your output quality, and it compounds over time.

I will be chatting about this concept in a future blog of how you could see it in action. There is room for improvement beyond a single prompt, maybe you let the LLM create more dynamic prompts based on unique scenarios. Humans just bring in the unique inputs while you let the AI models refine, diverge and create unique prompts that trigger actual outcomes.

Loop 2: Model Fine-Tuning - Teaching the Model New Skills

Prompt refinement hits a ceiling. At some point, the instructions are as clear as they can be, and the model still can't produce what you need. It doesn't understand your domain's conventions deeply enough. It makes the same subtle mistakes no matter how precisely you describe what you want. This is where fine-tuning comes in.

The same evaluation data that drives Loop 1 also produces labeled training pairs. Outputs that pass the quality gate become positive examples, this is what good looks like. Outputs that fail become negative examples, this is what to avoid. Together they form the training signal for techniques like DPO (Direct Preference Optimization), where the model learns the contrast between good and bad outputs for the same input.

Fine-tuning doesn't replace the base model. Techniques like LoRA produce a lightweight adapter, a small set of learned weights that sit on top of the frozen base model and encode your domain-specific knowledge. The base model provides general intelligence. The adapter provides the specialized skill. Training an adapter is cheap (a few dollars per run on cloud GPUs) and fast (hours, not days), which means the flywheel can iterate quickly.

For teams that need even deeper improvement, reinforcement learning (RL) takes this further. Instead of learning from fixed pairs of good and bad examples, the model actively generates new outputs during training, gets scored in real time by the eval framework, and adjusts its weights based on the reward signal. This is how Cursor built their Composer model, reinforcement learning inside real coding environments, where the environment itself provides the feedback.

The key insight across all of these techniques: the rubric-based eval framework you built for Loop 1 generates the training data for Loop 2 as a natural byproduct. You don't need a separate data labeling step. The flywheel produces its own training signal through normal usage.

Loop 3: Models Building Models - The Frontier

This is Anthropic's Recursive Self-Improvement territory. Instead of improving prompts or fine-tuning adapters, the eval data and learnings feed into designing the next generation of the model itself, architecture changes, training recipe improvements, novel optimization techniques.

This is where things get expensive. Training a frontier LLM from scratch costs tens of millions to hundreds of millions of dollars. That's not a side project. That's not even most companies. Loop 3 is for the Anthropics, the Googles, the OpenAIs of the world, organizations with the compute budgets and research teams to build foundation models.

Loop 3 doesn't have to mean building a massive LLM. A frontier model could be used as the architect for improving smaller, specialized machine learning models, not language models, but vision models, classification models, recommendation systems. The eval-and-improve pattern is the same. The scale is just different. A frontier LLM acting as the researcher, analyzing eval results from a smaller specialized model, proposing architecture tweaks, and testing them, that's Loop 3 at an accessible scale.

For most teams, Loop 3 is worth understanding but not worth building. Loops 1 and 2 are where the practical value lives today. But as the cost of compute continues to drop and open-weight models continue to improve, the line between "fine-tuning someone else's model" and "building your own" will keep blurring.

What Makes a Flywheel Architecture Work

Two things keep the flywheel from being just a fancy script: humans and state.

Humans keep it honest. A fully autonomous loop has a dangerous failure mode, if the rubric has a blind spot, the system optimizes toward that flaw and compounds it. Overfitting to its own judgment. Humans break that cycle by reviewing the AI's evaluations, not every output. The AI scores and explains its reasoning. The human confirms or overrides. Both signals get stored. The practical pattern is an AND gate: approved only when the AI judge and the human both agree. Early on, humans review everything. As trust builds, the gate relaxes to spot-checking. But it never goes away entirely. Until AI Systems can truley think for themselves, Humans will remain where they are needed the most, to provide their creativity and critical thinking skills.

State makes it compound. A flywheel without memory is just a script you run manually. The system needs to track its own evolution, prompt versions with pass rates, evaluation scores linked to the prompt and model that produced them, training pairs accumulating over time, and every human override with its reasoning. That means a database, not files in a repo. The automation loops need to read and write state programmatically: the prompt refinement loop reads failure patterns from evaluation history, the training trigger fires when enough pairs accumulate, and a dashboard shows whether the system is actually improving or just churning. Without persistent state, you have a collection of tools. With it, you have a system that knows where it is and what to do next.

Evaluating the use of the Flywheel

Before building a flywheel, pause and ask whether you actually need one. Not every AI integration benefits from self-improvement loops. If you’re calling an API and the output quality is already good enough, adding an eval framework and training pipeline can be extra overhead with little payoff. A flywheel is worth the complexity when a few things line up:

The task is repeated with variation. You're generating the same kind of output hundreds or thousands of times: components, documents, classifications, extractions, but each input is slightly different. One-off tasks don't produce enough volume for the system to learn from.

Quality is measurable. You can define what "good" and "bad" look like in a rubric. If quality is purely subjective, "does this feel right?", the eval loop has nothing to score against. The more concrete the criteria, the better the flywheel works. "Does the schema compile?" is a stronger signal than "does the code look clean?"

Domain expertise matters. The base model's generic output isn't good enough. You need it to understand your conventions, your patterns, your edge cases. If a general-purpose model already produces exactly what you need, there's nothing for the flywheel to improve.

The cost of bad output is real. If a developer spends 30 minutes fixing every generated component, and you're generating 50 components a week, that's 25 hours of rework. A flywheel that cuts rework by half pays for itself immediately. If bad output just means regenerating with a better prompt, the stakes may not justify the infrastructure.

If none of those apply, you're building a general-purpose chatbot, your AI usage is occasional rather than systematic, or you don't own the model layer at all, a flywheel is probably overkill. Pure API consumers with no access to fine-tuning can still benefit from Loop 1, but the ceiling is lower without the ability to touch model weights.

The honest answer for most teams: start with Loop 1. Add an eval framework to whatever you're already building. Track pass rates. Let the data tell you whether Loop 2 is worth the investment. You'll know when you need it. Loop 1's improvements will plateau, and the failures that remain will be things no prompt can fix.

Teaser

And with that said, I’ve been working on one full blown POC and several more ideas of how this could be used, and will be rolling out shortly. Stay tuned for that!