Give your AI copilot a RAG wiki of its own work
Bigger context windows hallucinate more, not less. The wiki is what scopes it back down.
I asked Sonnet, through Copilot, to modify one specific file that was responsible for SLO automation in Dynatrace. The change was small. One file. One tweak.
It modified the wrong file.
The session had auto-compacted partway through. The conversation history that explained which file was load-bearing had been summarized down to a few paragraphs, and the load-bearing nuance was no longer in there. By the time I asked for the change, the agent’s working memory was a patchwork. It picked a different configuration file I had touched earlier and applied my SLO change there. The SLO file itself was untouched.
That was the day I started giving every project its own wiki.
The Shape of the Problem
If you have worked with Claude Code, Sonnet, Codex, or any agentic coding setup for more than a couple of weeks, this pattern is familiar. Compaction is lossy by design. It summarizes conversation history to fit the context window, and the nuance and rationale that lived in chat are gone. CLAUDE.md, auto memory, and unscoped rules survive (re-injected from disk). Conversation history does not.
Anything that lives only in the conversation is at risk every time the context window fills. Anything on disk survives.
New sessions are worse. Every fresh session means feeding context again. Here is what we built. Here is what we decided. Here is what is pending. You are playing telephone between sessions. The receiver (your agent) gets a slightly degraded copy every time, and half the time you forget to mention a decision you made three sessions ago because the decision now feels obvious to you.
The intuitive fix is to fatten the project’s CLAUDE.md. Stuff every directive in there. Anthropic’s own guidance pushes back: target under 200 lines. Past that, the file consumes more context and reduces adherence. A 2,000-line CLAUDE.md does not solve the memory problem. It creates a new one.
So the wiki is not an aesthetic choice. It is the response to a structural failure that compaction, telephone, and a bloated CLAUDE.md do not address.
Karpathy’s wiki idea, rotated
The seed for this came from Andrej Karpathy. He floated the idea of an LLM wiki, a structured agent-readable knowledge base, and most early uptake went in the personal-knowledge-management direction. People used it as a second brain.
I rotated it ninety degrees. Instead of a wiki per person, a wiki per project.
Every architecture decision, every build-log entry, every open question, every review finding, every workflow definition. All of it living in flat markdown that the agent reads on every session start and writes back to as work happens. Not embeddings. Not a vector store. Markdown. Files I can open in any editor, files Claude can grep, files git tracks line-by-line.
Here is the link to Karpathy’s original gist.
The pattern is borrowed; the rotation to per-project is what I have been refining across a few production projects since.
Where the wiki actually lives
A wiki is only useful if both you and the agent can read and write to the same source. So before talking about what goes in it, the question is where it lives on disk.
The structural decision that took me longest to settle on: the wiki does not live inside the project repo. It sits as a sibling.
parent/
├── kb/
│ ├── raw/ # PRDs, arch docs, performance analyses
│ ├── sessions/ # YYYYMMDD-HHMM-slug.md per Claude session
│ └── wiki/ # the structured knowledge base
└── project/ # the actual codeTwo reasons. First, the repo stays clean. No accidental commits of session logs into a project repo. No .gitignore gymnastics. The agent reads from kb/, then drops into project/ as the working directory for actual code changes.
Second, the wiki can be edited by both me and Claude in parallel. When I am sketching out an architecture, those notes land in kb/ as markdown. Claude can read them, refine them, ask questions. We are both working on the same files. That symmetry is the thing flat markdown gets right that vector stores get wrong.
Here is the shape one of my recent projects converged on:
kb/wiki/
├── architecture/ # design decisions and rationale
├── artifacts/ # schemas, generated specs, fixed outputs
├── competitors/ # competitive landscape (per-tool + summary)
├── core/ # load-bearing product mechanics
├── ideation/ # origin / conversation summaries
├── reviews/ # multi-agent review outputs
├── validation/ # live-environment test results
├── workflow/ # user journeys + CLI command surfaces
├── index.md # backlinked table of contents
├── log.md # wiki-mutation chronology
├── open-questions.md # outstanding threads, marked closed when resolved
├── roadmap.md # release / build cycle plan
├── build-log.md # what's been built, across the whole project
├── build-plan.md # the overarching plan
└── build-plan-template.md # reused per-feature / per-phaseSmaller projects run with fewer directories. Some never need a competitors/. Some grow a migrations/ or a security/. The structure is adaptive, not prescriptive. When I bootstrap a new wiki, I let Claude propose what directories to include based on the actual product and architecture documents. The question is not “what is the canonical structure.” The question is “what does this specific project’s runbook need to track?”
How the wiki drives runtime behavior
A static directory of markdown files would be a documentation site. What makes the wiki an actual harness component is that some of the files in it are not just records, they are inputs that drive runtime behavior. The build-plan-template.md file is the clearest example.
When I tell Claude ”let’s work on the next feature,” it does not write code first. It checks the wiki. Is there a build plan for this feature? No? Then before any code gets touched, it constructs one.
It pulls relevant requirements from the architecture document, cross-references the product doc, follows the build-plan-template structure, and generates a per-feature plan.
Thirty seconds. Maybe a minute. Then code.
The token cost is negligible. What the loop buys you is something hard to measure but real: every feature has an artifact you can compare against. Was it implemented? Tested? Reviewed? All of that anchors to a plan that existed before the code did.
Skip the loop and that context lives somewhere else. In your head. In a Slack thread. In a half-remembered conversation. The not-quite-remembered parts come to bite you in the ass when the next feature lands and you have forgotten the constraint that mattered.
Where the operational rules live
The build-plan-template loop only fires because of an operational rule that tells Claude to consult the template before any new feature is touched. That rule, along with everything else that defines how the agent interacts with the wiki, needs to live somewhere. The natural reflex is to put it all in CLAUDE.md.
I do not. The rules live in project-scoped rule files under .claude/rules/, with a baseline set at ~/.claude/rules/ for user-level. CLAUDE.md itself stays lean and points at the wiki.
Two reasons.
Portability. A core set of rules ports across every project: git workflow, session-log triggers, build-log writeback, lint discipline. I keep these at the user level so they apply everywhere. Project-level rules sit on top and supersede when they conflict, giving me the override lever without copy-pasting a baseline into every new repo.
Context budget. Anthropic’s guidance is to keep CLAUDE.md under 200 lines. Longer files consume more context on session start and reduce adherence. Splitting rules into smaller scoped files keeps each focused, and path-scoped rules only load when matching files are touched.
Concretely, the rules I keep at the project level look something like this:
On session start: read the most recent session log and the build log.
Main branch is locked. All merges go through PR.
After a feature ships: update the session log, the build log, the build plan, and the roadmap. Then surface a brief summary.
On request, run a multi-agent review (code, security, language-idiomatic, architect) and store findings in wiki/reviews/.
Writeback triggers and the lint regimen
Writeback is rule-driven, not hook-driven. I considered hooks. I rejected them.
Hooks fire on every event, indiscriminately. That is the wrong granularity. When I am compacting a session intentionally, I do not want a hook firing and shoving the entire pre-compaction state into a session log. I want a deliberate, manual nudge: ”hey, log everything we have done up to this point before we compact.” The agent does it. The compaction proceeds. Done.
The other writeback discipline is lint. Every four or five feature releases, or once per major release, I trigger a wiki lint pass. The agent walks the wiki:
Are there orphan documents, files that nothing else links to?
Are the backlinks current, or have any been broken by file renames?
Is index.md accurate? Does it reflect what is actually in the wiki?
I keep lint on-trigger by design. Five small feature adds in a week do not change the wiki shape. When I am pushing through a wave of major changes, I trigger a lint and let Claude tighten everything up. The lever stays in my hand.
When the wiki outgrows index.md traversal, I bring in qmd, a markdown query tool with BM25 plus vector retrieval plus LLM re-ranking. It sits in the operational rules as the escape valve. Claude can decide how far down the rabbit hole it wants to travel.
The payoff: cross-release consistency, twelve releases apart
I shipped a project’s first feature fast. Over seven days, working evenings after my day job, Claude and I rolled out somewhere around eleven or twelve releases. One per session, sometimes two.
Then I took a two-day break. I deserved it.
When I came back, fresh session, no in-memory context, I asked: ”OK, what is next in the queue?” Claude told me the next three releases. Standard.
I asked the follow-up: ”Anything else pending?”
Claude came back with hygiene items. Things we had discovered during a previous release that we had flagged for follow-up. Items we needed to fix before starting the next one. I had completely forgotten them. Claude had not, because the wiki had logged them when they surfaced.
Fast-forward several weeks. Release fourteen is done. I run the same check.
Claude tells me there is a normalization fix from a middle release that got rolled into the latest branch but had not been backported to release one. The first one we ever shipped. Twelve releases and several weeks back.
That is not context. That is a runbook. The session that surfaced the original finding closed long ago. What it had was the wiki: decisions, hygiene items, and pending changes recorded across fourteen releases, regardless of how or in what sequence they came about.
This is the failure mode that compaction cannot solve, that bigger context windows cannot solve, that better prompting cannot solve. You need a persistent artifact outside the session, and the agent disciplined to read and write it.
The payoff: the harness catching itself
Same project. A release tag was about to ship. The multi-agent review was supposed to run before the tag, and we had accidentally skipped it.
Claude caught the gap. As it was logging the build, it noticed the agent review had not run. Tag already in flight. So it let the tag ship, ran the review post-tag, logged the findings into the wiki, fixed them, and queued for the next release.
I did not catch the miss. The harness caught itself, because the wiki was structured to expect a review at that gate, and the build-log discipline made the gate visible.
That is the load-bearing claim. The wiki is not just memory. It is the checklist the agent uses to keep itself honest.
When NOT to do this
The honest counterweight: this is overkill for small projects. If I am writing a single-script Python utility, say a cron job that scans my email for “unsubscribe” and deletes those messages, I do not set up a wiki. I describe the script, Claude writes it, I drop it on my system, done. Spinning up a wiki is more time than the project deserves.
Same for one-page websites, prototype demos, anything where the entire scope can be articulated in one prompt. If the project is small enough that summary-after-compaction is sufficient context for the next session, the wiki is overhead.
What matters more: the wiki is reachable retroactively. Start small. When the project grows, when sessions start playing telephone, when you are spending more time re-explaining than building, that is when you bootstrap. Same flow. Same prompt. Same five-minute interview. It is not too late.
The bootstrap recipe
Here is the prompt I use to seed a new project wiki. It is generic enough to copy-paste for any codebase:
I want to set up an LLM wiki for this project.
Attached are the product, technical, and architecture documents.
Read and internalize them so you understand what we are building.
Then read Andrej Karpathy’s LLM wiki guide: <Wiki Link>
Once you understand both the project and the wiki pattern, ask me the questions you need answered to design a wiki structure for THIS specific project: directory layout, root files, conventions, what is worth tracking and what is not.
I will answer. Then build the wiki: directories, root files, backlinks, index.
Five minutes of Q&A. Maybe seven if the project is unusually complex. Claude proposes a directory layout adapted to the project’s specific shape, asks me to confirm or adjust, then builds the wiki: directories, root files, backlinks, an index.md that points everywhere.
Then the loop begins. Every session writes back. Every feature spawns a build plan. Every review lands in reviews/. Every hygiene item gets logged. The wiki compounds.
The runbook and the output
Here is the durable mental model:
The wiki is the project’s runbook. The code is the project’s output. Without it, every new session is just telephone.
Two artifacts, two lifecycles. The code answers what does the system do. The wiki answers what did we decide, what is pending, what hurt last time, and what should the next session pick up first. Conflate them and you lose the runbook, because git pretends to capture it and does not.
The wiki is the foundational harness component. Sub-agent reviews depend on it. Multi-session continuity depends on it. Cross-feature consistency depends on it. Every other piece of harness scaffolding leans on this one thing.
If you take one thing from this post: when you start your next agentic project, set up the sibling kb/ directory before you write a single line of code. Use the bootstrap prompt above. Watch what stops happening: the telephone game, the lost rationale, the hygiene items that fall off the back of the truck twelve releases later.
The harness turns agentic coding from a chat into engineering. The wiki turns the harness from a one-shot setup into a practice that compounds.



