The most well-replicated finding in cognitive science is one of the least intuitive: the best way to make information stick is not to study it more, but to retrieve it more. Testing yourself is more effective than re-reading. Answering a question is more effective than reviewing the answer. This is the testing effect, and it has been replicated across hundreds of studies over 100 years.
The standard AI coding tool workflow is almost perfectly designed to prevent it. You see a diff. You read it — or you don't. You press accept. Your brain never has to retrieve anything. You move to the next task. The concept the agent just used on your behalf goes into the bucket of "things I've seen" rather than "things I can reproduce."
At aiworklab, we use a spaced repetition system to put retrieval back into the loop. Here's how it works and why we made the choices we made.
The forgetting curve and what defeats it
Hermann Ebbinghaus established in the 1880s that memories decay exponentially over time — rapidly at first, then more slowly. Without reinforcement, most of what you learn is gone within a week. But the curve can be reset: reviewing material at the right moment — just before you'd forget it — produces a stronger memory trace and extends the interval before the next review is needed.
Spaced repetition formalises this. Review on day 1, then day 3, then day 8, then day 21, then day 60. Each successful retrieval extends the next interval. Failed retrievals shorten it. Over time, the intervals grow and the total review time for any given concept approaches zero — but the concept stays accessible.
For flashcard systems like Anki, this works well for facts: vocabulary, dates, formulas. For programming concepts, it's less obvious how to apply it. You can't write "explain async cancellation" on a flashcard and expect a useful retrieval exercise. The abstraction is too thin; the answer is too context-free to build durable understanding.
Our solution: code-grounded retrieval
The key insight we built the system around is this: the most effective retrieval exercise for a programming concept is one grounded in code you wrote. Not a synthetic problem, not documentation, not a LeetCode equivalent. Code from your repository, in the context where you encountered the concept, with real constraints and real decisions.
When the scheduler determines it's time to review "async cancellation," the prompt it generates doesn't come from a concept database. It comes from a real diff — one you committed three weeks ago that touched an AbortController — and asks a specific question about it: "In the request handler you wrote on March 4th, the abort signal was passed to two nested fetches. If the first fetch completes before the signal fires, what happens to the second one, and why?"
That's a retrievable question. It's grounded in your code. It's testing genuine understanding of the concept in a context you actually know. And answering it reinforces the same memory trace that "touching" the concept in your real work started building.
Why FSRS instead of SM-2
Most spaced repetition software until recently used SM-2, an algorithm developed by Piotr Wozniak in the 1980s. SM-2 is good. Anki used it for years and millions of people learned languages and medicine with it.
FSRS (Free Spaced Repetition Scheduler) is the current state of the art. Its advantages over SM-2 are well-documented but the key one for our use case is better calibration on small sample sizes. SM-2 needs a lot of historical data per concept to schedule accurately; FSRS performs well from the first few reviews. Since any given user will have encountered most programming concepts a small number of times, this matters a lot.
FSRS also models "difficulty" as a per-concept property that updates over time, rather than a fixed parameter. A concept that the user keeps failing comes down in difficulty rating more aggressively, scheduling more frequent reviews. One they consistently ace gets pushed further out sooner. The model is more responsive to the actual evidence.
Finally, FSRS is open-source, has an active research community, and is the algorithm Anki and Mochi have both moved to. We wanted to build on something with momentum and peer review, not something proprietary.
How concepts get extracted from diffs
The pipeline looks like this:
- Your agent produces a diff and the session reports it to the teaching kernel.
- The kernel runs the diff through a tree-sitter query layer. Each language has a set of tagged queries that fire when known structural patterns appear: an AbortController instantiation, a JWT decode call, a prepared statement with parameterised input. These fire with high precision and low latency — they're deterministic rules, not inference.
- For patterns not covered by the rule layer — novel library usage, domain-specific idioms, complex interactions — the kernel sends the diff to an LLM with a structured extraction prompt. The LLM outputs a list of candidate concepts with confidence scores. Low-confidence candidates are logged but not scheduled for review until they're encountered again.
- The extracted concepts are matched against the user's existing skill graph. Known concepts at "demonstrated" or "mastered" state are passed through — no check triggered. New or "encountered" concepts trigger a comprehension check before merge.
- For concepts that make it into the skill graph, the FSRS scheduler assigns an initial stability value and schedules the first retrieval. When that review is due, the kernel retrieves the original diff, generates a grounded question, and queues it in the daily session.
The quality problem: retrieval prompts that aren't retrieval
The hardest problem in this system is prompt quality. It's easy to generate a question that looks like retrieval but doesn't actually test understanding. "What does pg_advisory_lock do?" is a bad retrieval question — it tests vocabulary, not comprehension. "Your migration used a session-level advisory lock. What happens to the lock if the database connection drops mid-migration?" is a good one — it requires reasoning about the mechanism in a failure scenario.
We've spent considerable time on the prompt generation pipeline. The current version uses a structured template with several mandatory elements: the specific code being referenced, a forced failure scenario or edge case, and a question that can only be answered correctly by someone who understands the mechanism, not just its name. We evaluate the questions with a second LLM pass that checks for surface-level answerability — questions that can be answered by keyword matching fail this check.
The quality bar for retrieval prompts is also asymmetric: false positives (questions that seem hard but are easy) are more damaging than false negatives (concepts that don't get a question). A bad question trains the scheduler to think a concept is understood when it isn't. We err toward fewer, higher-quality prompts.
What we're measuring to know it's working
The primary quality signal is concept retention rate: what fraction of concepts that a user has passed a comprehension check on do they still demonstrate understanding of 30 days later? We measure this by re-running a parallel retrieval session on a sample of previously-demonstrated concepts.
We're targeting 75% retention at 30 days, which is roughly where well-calibrated Anki users perform on well-curated decks. Current private alpha numbers are around 71%, with significant variance by concept type (syntax concepts retain better than architectural concepts, which is expected).
The secondary signal is solo throughput trend, described in our earlier research piece. If FSRS is working, solo throughput should be rising over time even as assisted throughput stays flat or improves. That's the signal that the retrieval system is doing what it's supposed to do: building durable capability, not just surface familiarity.
What's next
The current pipeline handles TypeScript, Python, Go, Rust, and SQL well. Java, Swift, and C++ are in the rule-layer backlog. The LLM fallback handles them adequately but with lower precision. We're also working on cross-concept retrieval — sessions that ask about interactions between concepts ("you understand advisory locks and transaction isolation levels separately — here's a scenario where both matter") rather than each concept in isolation.
If you're interested in the technical details of the FSRS implementation or the tree-sitter query design, write to team@aiworklab.com. The teaching kernel will be open-source and we plan to publish the query set before public beta.