Skip to content

LLM Knowledge Base

Summary

An LLM Knowledge Base is a personal research system where an LLM incrementally compiles raw source documents into a structured, interlinked Markdown wiki — a "persistent, compounding artifact" — rather than re-deriving knowledge from scratch on every query as RAG systems do. Described by entities/andrej-karpathy in a GitHub gist (Apr 4, 2026), the pattern has three layers: immutable raw sources, an LLM-owned wiki, and a schema document that governs the LLM's behavior. This wiki is itself a direct implementation of this pattern.

Details

The Core Distinction from RAG

In conventional RAG systems (NotebookLM, ChatGPT file uploads), the LLM retrieves relevant chunks at query time and re-synthesizes an answer. Nothing accumulates. Subtle questions requiring synthesis across multiple documents must be pieced together fresh every time.

The LLM wiki pattern inverts this: the LLM builds the synthesis once, during ingestion, and keeps it current. When a new source arrives: - Entity and concept pages are updated - Contradictions with existing claims are flagged - Cross-references are woven in

The wiki keeps getting richer. Cross-references are already there. Contradictions have already been surfaced. A single source might touch 10–15 pages.

"The wiki is a persistent, compounding artifact." — [source: llm-wiki-karpathy.md]

Three-Layer Architecture

Layer Description Who owns it
Raw sources Immutable source documents (articles, papers, images, data). Source of truth. Human
Wiki LLM-generated Markdown: summaries, entity pages, concept pages, indexes. LLM
Schema Config document (e.g. CLAUDE.md, AGENTS.md) defining wiki structure, conventions, and LLM workflows. Co-evolved with the LLM over time. Human + LLM

The schema is what makes the LLM a disciplined wiki maintainer rather than a generic chatbot. It specifies directory layout, article templates, ingestion pipeline, index formats, and behavioral rules. You and the LLM refine it as you learn what works for your domain.

"Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase." — [source: llm-wiki-karpathy.md]

Core Operations

Ingest. Human adds a source to raw/. LLM reads it, discusses key takeaways, writes a summary page, updates relevant entity and concept pages, and appends to the log. At scale this could be batch and unsupervised, but single-source ingestion with human involvement produces better-emphasized, better-organized output.

Query. LLM reads the index to find relevant pages, reads them, and synthesizes an answer. Answers can be Markdown pages, comparison tables, Marp slide decks, or matplotlib charts. Crucially: good answers should be filed back into the wiki as new pages. Explorations compound just like ingested sources do.

Lint. Periodic health checks: contradictions between pages, stale claims, orphan pages, missing cross-references, important concepts lacking their own article. The LLM suggests questions to investigate and new sources to find.

Why No RAG at Moderate Scale

At ~100 articles / ~400K words, no vector database is needed: - The LLM maintains index files and summaries for all documents - Given a query, the LLM reads the index, identifies relevant articles, reads them, and answers - The structure itself is navigable without embeddings

RAG becomes relevant at larger scale; the transition point is not well-defined. A simple two-stage search tool (keyword index scan → full-text fallback) bridges the gap before embedding-based retrieval is needed. — [source: tools/search.py in this repo]

Indexing and Logging

Two navigation files serve distinct purposes:

_index.md (or index.md) — content-oriented catalog. Every page listed with a one-line summary, organized by category. LLM reads this first on every query to triage. Works well at moderate scale without search infrastructure.

log.md — chronological append-only record of operations (ingests, queries, lints). Grep-parseable format: ## [YYYY-MM-DD] {operation} | {description}. Gives the LLM a timeline of what has changed recently and helps orient new sessions.

Use Cases

The pattern is general — any domain where knowledge accumulates over time:

  • Personal research — going deep on a topic over weeks/months; wiki builds an evolving thesis
  • Reading a book — file each chapter, build pages for characters/themes/plot threads; end up with a fan-wiki-style companion (cf. Tolkien Gateway)
  • Business/team — internal wiki fed by Slack threads, meeting transcripts, customer calls; LLM does the maintenance no one wants to do
  • Personal — goals, health, psychology; journal entries and articles build a structured self-picture over time
  • Competitive analysis, due diligence, trip planning, course notes, hobby deep-dives

Tooling

Obsidian is the recommended frontend: graph view shows the wiki's shape (hubs, orphans), Marp plugin renders slide output, Dataview plugin queries YAML frontmatter (tags, dates, source counts) as dynamic tables, Web Clipper extension converts web articles to Markdown for raw/.

Image handling: In Obsidian Settings → Files and links, set attachment folder to raw/assets/. Bind "Download attachments for current file" to a hotkey (e.g. Ctrl+Shift+D) to localize all images after clipping. Note: LLMs can't process markdown with inline images in one pass — read text first, view images separately.

Search tools: At small scale, the index file suffices. As the wiki grows, options include: - Custom two-stage script (keyword index scan + full-text fallback) — zero deps, fast, good to ~200 articles - qmd (github.com/tobi/qmd) — on-device hybrid BM25/vector search with LLM reranking; Node.js/Bun, GGUF models; CLI + MCP server; recommended at 100+ articles

Historical Antecedent: The Memex

Karpathy cites Vannevar Bush's Memex (1945) as the closest historical precursor. The Memex was a theoretical personal knowledge machine with "associative trails" linking documents — private, curated, with the connections as valuable as the documents themselves. Bush's vision was closer to the LLM wiki pattern than to what the web became. The part he couldn't solve was who does the maintenance. The LLM handles that.

See entities/vannevar-bush.

Key Claims & Data Points

  • At ~100 articles / ~400K words, index-based LLM navigation works without RAG — [source: karpathy_thread]
  • A single source ingestion may touch 10–15 wiki pages — [source: llm-wiki-karpathy.md]
  • "The wiki is a persistent, compounding artifact" — Karpathy — [source: llm-wiki-karpathy.md]
  • "Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase" — Karpathy — [source: llm-wiki-karpathy.md]
  • LLMs don't get bored, don't forget to update a cross-reference, can touch 15 files in one pass — [source: llm-wiki-karpathy.md]
  • Lex Fridman independently uses same architecture for podcast research — [source: karpathy_thread]
  • "I think there is room here for an incredible new product instead of a hacky collection of scripts" — Karpathy — [source: karpathy_thread]

Open Questions

  • At what scale (articles, words) does the index-based approach break down and require RAG or vector search? (raised by: concepts/llm-knowledge-base, 2026-04-03)
  • What does synthetic data generation + finetuning look like in practice for personal knowledge bases? (raised by: concepts/llm-knowledge-base, 2026-04-03)
  • What product could formalize this "hacky collection of scripts" into a polished tool? (raised by: concepts/llm-knowledge-base, 2026-04-03)
  • How does the ephemeral wiki pattern (spawn, lint, report, discard) differ from persistent wiki maintenance — when is each appropriate? (raised by: concepts/llm-knowledge-base, 2026-04-03)
  • qmd: what are exact system requirements and is it actively maintained? At what article count does it become worth the setup cost over a custom script? (raised by: concepts/llm-knowledge-base, 2026-04-08)

Sources

  • Thread by @karpathy — Karpathy describes LLM knowledge base workflow, Apr 2026
  • LLM Wiki (GitHub Gist) — Karpathy's canonical pattern document; three-layer architecture, operations, tooling, Memex analogy; Apr 4, 2026