Show HN: A Karpathy-style LLM wiki your agents maintain (Markdown and Git)

github.com

I shipped a wiki layer for AI agents that uses markdown + git as the source of truth, with a bleve (BM25) + SQLite index on top. No vector or graph db yet.

It runs locally in ~/.wuphf/wiki/ and you can git clone it out if you want to take your knowledge with you.

The shape is the one Karpathy has been circling for a while: an LLM-native knowledge substrate that agents both read from and write into, so context compounds across sessions rather than getting re-pasted every morning. Most implementations of that idea land on Postgres, pgvector, Neo4j, Kafka, and a dashboard.

I wanted to go back to the basics and see how far markdown + git could go before I added anything heavier.

What it does: -> Each agent gets a private notebook at agents/{slug}/notebook/.md, plus access to a shared team wiki at team/.

-> Draft-to-wiki promotion flow. Notebook entries are reviewed (agent or human) and promoted to the canonical wiki with a back-link. A small state machine drives expiry and auto-archive.

-> Per-entity fact log: append-only JSONL at team/entities/{kind}-{slug}.facts.jsonl. A synthesis worker rebuilds the entity brief every N facts. Commits land under a distinct "Pam the Archivist" git identity so provenance is visible in git log.

-> [[Wikilinks]] with broken-link detection rendered in red.

-> Daily lint cron for contradictions, stale entries, and broken wikilinks.

-> /lookup slash command plus an MCP tool for cited retrieval. A heuristic classifier routes short lookups to BM25 and narrative queries to a cited-answer loop.

Substrate choices: Markdown for durability. The wiki outlives the runtime, and a user can walk away with every byte. Bleve for BM25. SQLite for structured metadata (facts, entities, edges, redirects, and supersedes). No vectors yet. The current benchmark (500 artifacts, 50 queries) clears 85% recall@20 on BM25 alone, which is the internal ship gate. sqlite-vec is the pre-committed fallback if a query class drops below that.

Canonical IDs are first-class. Fact IDs are deterministic and include sentence offset. Canonical slugs are assigned once, merged via redirect stubs, and never renamed. A rebuild is logically identical, not byte-identical.

Known limits: -> Recall tuning is ongoing. 85% on the benchmark is not a universal guarantee.

-> Synthesis quality is bounded by agent observation quality. Garbage facts in, garbage briefs out. The lint pass helps. It is not a judgment engine.

-> Single-office scope today. No cross-office federation.

Demo. 5-minute terminal walkthrough that records five facts, fires synthesis, shells out to the user's LLM CLI, and commits the result under Pam's identity: https://asciinema.org/a/vUvjJsB5vtUQQ4Eb

Script lives at ./scripts/demo-entity-synthesis.sh.

Context. The wiki ships as part of WUPHF, an open source collaborative office for AI agents like Claude Code, Codex, OpenClaw, and local LLMs via OpenCode. MIT, self-hosted, bring-your-own keys. You do not have to use the full office to use the wiki layer. If you already have an agent setup, point WUPHF at it and the wiki attaches.

Source: https://github.com/nex-crm/wuphf

Install: npx wuphf@latest

Happy to go deep on the substrate tradeoffs, the promotion-flow state machine, the BM25-first retrieval bet, or the canonical-ID stability rules. Also happy to take "why not an Obsidian vault with a plugin" as a fair question.

Daily Digest email

Get the top HN stories in your inbox every day.

portly

I don't understand the point of automating note taking. It never worked for me to copy paste text into my notes and now you can 100x that?

The whole point of taking notes for me is to read a source critically, fit it in my mental model, and then document that. Then sometimes I look it up for the details. But for me the shaping of the mental model is what counts

frocodillo

First of all, this is more than just note taking. It appears to be a (yet another) harness for coordinating work between agents with minimal human intervention. And as such, shouldn’t part of the point be to not have to build that mental model yourself, but rather offload it to the shared LLM “brain”?

Highly debatable whether it’s possible to create anything truly valuable (valuable for the owner of the product that is) with this approach, though. I’m not convinced that it will ever be possible to create valuable products from just a prompt and an agent harness. At that point, the product itself can be (re)created by anyone, product development has been commodified, and the only thing of value is tokens.

My hypothesis is that “do things that don’t scale”[0] will still apply well into the future, but the “things that don’t scale” will change.

All that said, I’ve finally started using Obsidian after setting up some skills for note taking, researching, linking, splitting, and restructuring the knowledge base. I’ve never been able to spend time on keeping it structured, but I now have a digital secretary that can do all of the work I’m too lazy to do. I can just jot down random thoughts and ideas, and the agent helps me structure it, ask follow-up questions, relate it to other ongoing work, and so on. I’m still putting in the work of reading sources and building a mental model, but I’m also getting high-quality notes almost for free.

[0]: https://www.paulgraham.com/ds.html

bushido

If you think of an agent harness as a tool which you use to build your product, then I think you might be absolutely right. I don't see it being easy for a harness to ever build a product.

I actually think that the harnesses which do end up building products, the harness will be the product.

As an example, I have a harness which I have my entire team use consistently. The harness is designed for one thing: to get the results I get with less nuanced understanding of why I get it.

Mind you, most of my team members are non-technical, or at least would be considered non-technical, two years ago.

These days, I spend most of my time fine-tuning the harness. What that gives me is a team which is producing at 5x their capacity from three months ago, and I get easier to review, more robust pull requests that I have more confidence in merging.

It's still a far cry from automating the entire process. I still think humans need to give the outcomes to even the harnesses to produce the results.

frrandias

Hey, one of the contributors to Wuphf here

I think your take is right. This isn't going to help with the internalization of knowledge that note taking will get you. I do think that there is some value in the way we've set up blueprints of agents if you haven't set up a business before to either teach about role functions in a business or get a head start on business that doesn't create something new. At the very least it's a quick setup to getting to experiment.

To the part about note taking (and disclosure) - we are working on a context graph product that lessens the work of reading sources, especially over time and breath to help with a lot of the structure you've mentioned.

jstummbillig

> My hypothesis is that “do things that don’t scale”[0] will still apply well into the future, but the “things that don’t scale” will change.

Say more?

mplappert

I think there‘s a serious issue with people using AI to do an immense amount of busywork and then never look at it again. Colossal waste.

simsla

Everyone is writing. Nobody is reading.

mohamedkoubaa

In my over a decade of experience as a software engineer, writing code was always a smaller fraction of my time compared to reading code, debating code with colleagues, and wrangling ops. Optimizing for velocity of writing code will inevity lead to spaghetti at best and vaporware at worst.

skybrian

The LLM's do quite a lot of reading. The question is what to feed them. (What counts as good context?)

nicbou

It's such a promising technology, but it seems like the primary use case is to drown everything in noise.

4b11b4

It's almost akin to eating without thinking... just a waste, no nutrition. Additionally, damaging.

big_man_ting

Totally agree re note taking. We treat our notes way too lightly, just as an attic or a basement leads to hoarding more stuff than you'll ever need.

Most things do not need to end up in your notes, and LLMs add too much noise, one that you likely never personally verify/filter out at all.

JA Westenberg made a good video essay about it a few days ago:

https://youtube.com/watch?v=3E00ZNdFbEk

stingraycharles

The few scientific studies out there actually show a degradation of output quality when these markdown collections are fully LLM maintained (opposed to an increase when they’re human maintained), which I found fascinating.

I think the sweet spot is human curation of these documents, but unsupervised management is never the answer, especially if you don’t consciously think about debt / drift in these.

saadn92

I've been running a variation of this for ~6 months. What seems to work: a background process that reads conversation transcripts after sessions end and then extracts decisions/rejected approaches into structured markdown. I review before I promote it into the context.

criley2

Are you referring to the one (1) study that showed that when cheaper LLM's auto-generated an AGENTS.md, it performed more poorly than human editted AGENTS.md? https://arxiv.org/abs/2602.11988

I'd love to see other sources that seek to academically understand how LLM's use context, specifically ones using modern frontier models.

My takeaway from these CLAUDE.md/AGENTS.md efforts isn't that agents can't maintain any form of context at all, rather, that bloated CLAUDE.md files filled with data that agents can gather on the spot very quickly are counter-productive.

For information which cannot be gathered on the spot quickly, clearly (to me) context helps improve quality, and in my experience, having AI summarize some key information in a thread and write to a file, and organize that, has been helpful and useful.

tomjwxf

[dead]

arikrahman

I thought this was parody at first as well for a redundant useless product as it was named after the redundant useless product of the same name from The Office (Wuphf.com)

adamsmark

I use my Openclaw setup to record notes I don't ever want to remember the details of. Here are some examples:

Storing my Health Insurance's Member ID, RxBin and other data. Recording the serial number of a product I will be calling technical support for. Organizing files to be more logical and deduplicating or consolidating as needed.

Whenever I want this info, I'll just ask my LLM to pull it up.

johntash

Do you use local models for these, or are you okay with giving private details to anthropic/openai?

(that's one of my biggest hurdles for really adopting any useful assistant type of agent)

Bridged7756

Ditto.

It circles back to the question, is this unimportant enough for me to delegate it to a LLM that might get it wrong? If the answer is yes, why even do it to begin with. If the answer is no, you have to do it manually.

I personally though, see value in this type of automation. Stuff like tag categorization, indexing, that otherwise would've been lost seems like a good fit for LLMs. Whether or not they're an ideal solution and something else like a search engine would've been a better fit, is a different question.

undefined

[deleted]

zby

Reviewed: https://zby.github.io/commonplace/agent-memory-systems/revie...

It is a third llm wiki on front page in 24 hours! Obviously it is a hot topic. I have my own horse in that race - so I might not be objective - but I've compiled a wishlist for these system: https://zby.github.io/commonplace/notes/designing-agent-memo...

I wish there was a chance for collaboration - everybody coding their own system seems like a lot of effort duplication.

Myrmornis

Your notes look really interesting, thanks. I'm curious --from the prose style it's clear they were written by an LLM. For design notes like this do you sort of have a mental TODO to go back and write them up in your own words to make sure they really capture your own opinions?

zby

For the design notes like: https://zby.github.io/commonplace/notes/designing-agent-memo... - I iterate over and over to clean them. This one is also a compilation with many intermediate documents.

But the reviews are written automatically - here are the instructions: https://github.com/zby/commonplace/blob/main/kb/agent-memory...

Overall the knowledgebase is a mixture of these. I have this disclaimer on the first page:

This KB is itself agent-operated: a human directs the inquiry, AI agents draft, connect, and maintain the notes. The framework for building knowledge bases is documented using that framework.

I hope it is enough - I've seen many people get angry with publishing LLM generated work.

najmuzzaman

love the "Borrowable Ideas" section. would suggest to definitely borrow them.

full disclosure: we started as a context infra company (nex.ai) from long long before Karpathy even came up with the LLM wiki idea, and have barely exposed any of that stuff to WUPHF but starting to open some of that now. glad to see the concerns in the comparison are things that our context infra already built for.

still, happy to collab & share learnings, and of course avoid duplication.

4b11b4

yes, generative slot machines are isolating. You say you "wish there was a chance"? As if there isn't?

frrandias

taking a look :)

SOLAR_FIELDS

I mean honestly this stuff is now in roll your own territory now. Run QMD on an obsidian vault and that's like 80% of the way there and you can probably do that in < 2 hours

vdelpuerto

[dead]

johntash

How do you keep llms from writing _too much_? I've built a few similar tools and systems, and they're all way too easy for the llm to just keep documenting things to the point the whole system is a mess and becomes less useful the bigger it gets.

One example experiment I had was seeing if I could get a llm to build its own knowledge wiki where I would paste a few links and have it go do some research on whatever the subject was, then distill what it finds into specific wiki pages with links to other pages or the source refs. It looked good until you read the actual data.

This was a few years ago though, so maybe it's worth trying again with something like opus 4.7.

hmokiguess

Someone should build a StackOverflow revival as the solution to this, a distributed knowledge graph curated by humans but driven by collective LLMs trying to problem solve their way out of things and stopping to ask questions in an old fashioned way.

I would be fine with my agent saying “hey, we hit a wall here, here’s the question posted on SO, I flagged to come back to it later once we have an answer”

mellosouls

Karpathy's original post for context:

https://x.com/karpathy/status/2039805659525644595

https://xcancel.com/karpathy/status/2039805659525644595

batoga

Put AI in your product name, make billion dollars. Put Karpathy in your blog article, get hired by Anthropic as Principal engineer. Milk money as long as fad last. No one is thinking about customer needs, everyone is trying to wash hands in the wave as it last.

girvo

Just like NFTs, just like the blockchain before that, in some ways kind of like the Web 2.0 craze (though we at least built some things then and the tight financing at the time kept a lid on it).

This LLM stuff at least has some real possibilities and value, and is very fun tech to learn about and play with.

I long ago accepted that there’s money to be made, as long as it’s not unethical, then get involved. Can build cool things that do have value, while enjoying the VC/PE money sloshing around

ting0

Hey man, if it works it works. There's a reason everyone is creating AI tools. We're all buying them. I'm still waiting for someone to make a world-class cli harness that can replace Claude Code but solves the memory and design problem. Web design is still a nightmare with LLMs.

najmuzzaman

thanks for saying this. one more reason people are hating on SaaS so much is that the UI is dry and for many, unusable. on the other hand, the cool AI agents have UI which is fun but again, unusable.

fun and usable can co-exist and we are trying the best to prove that. also, we have an amazing designer who never worked at big tech and has no accolades, but man got taste.

hmokiguess

Could you elaborate on the web design point? I find them excellent at it personally and it’s where I most often get value out of them

rglover

Cline. Works as a CLI and VSCode plugin.

najmuzzaman

alright sir/ma'am/neither. we built an AI-native CRM backed by HubSpot founder Dharmesh Shah last year before this, had revenue, iterated to focus on context graph infra which looked like the right moat to focus on, did enterprise PoCs, and all of that distilled into this personal project i built on the side to help my own work. turned out to be right interface for making context infra usable.

not interested in a job at Anthropic as Principal Engineer (i used to be a HubSpot Product Manager with a healthy income, much better than what i am making now, or for the next few years).

took multiple bets and did iterations because we talked to customers and kept evolving while our old competition is still building an AI CRM "in stealth".

been around enough to know waves don't matter but there is still value behind those waves worth extracting away.

mncharity

Just to stir thought, I note the TiddlyWiki[1] community (wiki as a self-modifying single html file; 20+ years old) has of course been exploring AI tooling... though not necessarily as an agentic environment. There's a markdown plugin, and others to make the file executable, or into a self-serving web app. Git is more problematic. So hypothetically, one could have a single-file agentic wiki wandering around and self-editing.

[1] https://tiddlywiki.com/

GistNoesis

The space of self building artefacts is interesting and is booming now because recent LLM versions are becoming good at it fast (in particular if they are of the "coding" kind).

I've also experimented recently with such a project [0] with minimal dependencies and with some emphasis on staying local and in control of the agent.

It's building and organising its own sqlite database to fulfil a long running task given in a prompt while having access to a local wikipedia copy for source data.

A very minimal set of harness and tools to experiment with agent drift.

Adding image processing tool in this framework is also easy (by encoding them as base64 (details can be vibecoded by local LLMs) and passing them to llama.cpp ).

It's a useful versatile tool to have.

For example, I used to have some scripts which processed invoices and receipts in some folders, extracting amount date and vendor from them using amazon textract, then I have a ui to manually check the numbers and put the result in some csv for the accountant every year. Now I can replace the amazon textract requests by a llama.cpp model call with the appropriate prompt while still my existing invoices tools, but now with a prompt I can do a lot more creative accounting.

I have also experimented with some vibecoded variation of this code to drive a physical robot from a sequence of camera images and while it does move and reach the target in the simple cases (even though the LLM I use was never explicitly train to drive a robot), it is too slow (10s to choose the next action) for practical use. (The current no deep-learning controller I use for this robot does the vision processing loop at 20hz).

[0]https://github.com/GistNoesis/Shoggoth.db/

manfre

llm-wiki is a popular topic. I've found it to be very helpful with keeping details of the random web pages I visit and want to remember. I create a claude plugin designed to work with Obsidian's Web Clipper and qmd for search.

https://manfre.me/posts/2026/04/build-llm-wiki-obsidian/

dataviz1000

LLM models and the agents that use them are probabilistic, not deterministic. They accomplish something a percentage of the time, never every time.

That means the longer an agent runs on a task, the more likely it will fail the task. Running agents like this will always fail and burn a ton of token cash in the process.

One thing that LLM agents are good at is writing their own instructions. The trick is to limit the time and thinking steps in a thinking model then evaluate, update, and run again. A good metaphor is that agents trip. Don't let them run long enough to trip. It is better to let them run twice for 5 minutes than once for 10 minutes.

Give it a few weeks and self-referencing agents are going to be at the top of everybody's twitter feed.

iterateoften

It’s also that agents and ML reach local maximima unless external feedback is given. So your wiki will reach a state and get stuck there.

dataviz1000

Here is an iteresting thing.

> "The LLM model's attention doesn't distinguish between "instructions I'm writing" and "instructions I'm following" -- they're both just tokens in context."

That means all these SOTA models are very capable of updating their own prompts. Update prompt. Copy entire repository in 1ms into /tmp/*. Run again. Evaluate. Update prompt. Copy entire repository ....

That is recursion, like Karpathy's autoresearch, it requires a deterministic termination condition.

Or have the prompt / agent make 5 copies of itself and solve for 5 different situations to ensure the update didn't introduce any regressions.

> reach local maximima unless external feedback is given

The agents can update themselves with human permission. So the external feedback is another agent and selection bias of a human. It is close to the right idea. I, however, am having huge success with the external feedback being the agent itself. The big difference is that a recursive agent can evaluate performance within confidence interval rather than chaos.

najmuzzaman

[dead]

renan_warmling

The idea seems good, but the system lacks snapshots and data enrichment for each file iteration. If the code breaks or has a bug, the agent could roll back the code and generate enrichment explaining the reason for the rollback, thus generating a new snapshot with updated states. Another issue is the weight of opinion: how will you guarantee integrity and consistency throughout the production of an operating system and avoid collisions and violations of business rules? And regarding persisted memory, currently your system doesn't distinguish between temporal and atemporal memory (business rules, software behaviors and functions, security policies, and governance between agents). The idea is good, but to function as a team, this must also be considered.

Abby_101

The "garbage facts in, garbage briefs out" caveat is the part I'd want stress tested. In my own LLM features the context that decays fastest is what agents wrote without a human glance. Six months in, you have entries that are confidently wrong and the lint pass can't tell which. Does the promotion flow require human review or can agents self promote?

najmuzzaman

[dead]

Daily Digest email

Get the top HN stories in your inbox every day.