Building a concept tagger that doesn't lie

Every interesting thing aiworklab does sits downstream of one component: the concept tagger. It reads a diff and outputs a list of programming concepts the diff touches, each with a confidence score. The skill graph is built from its output. Explain-to-merge checks fire on its output. The FSRS scheduler reviews its output. If the tagger is wrong, everything above it is wrong in a way the user can feel.

And the way the user feels it is specific: a mislabelled diff triggers a comprehension check on a concept the engineer has used for ten years. That's not a minor quality issue. That's the product calling a senior engineer a novice to their face. Two or three of those and the product is uninstalled, justifiably.

So we've spent a disproportionate share of our engineering time on this one component. This post describes its architecture and the design principle that governs it: calibration before accuracy.

The two-layer architecture

The tagger is a hybrid: a deterministic rule layer in front, an LLM fallback behind it.

Layer one: tree-sitter queries

The first layer is a set of tree-sitter queries per language. Tree-sitter parses the changed files into concrete syntax trees, and our queries fire when known structural patterns appear: an AbortController being instantiated and its signal threaded into a fetch, a jwt.verify call inside a try/catch that branches on TokenExpiredError, a parameterised query string passed to a prepared-statement API, a loop whose wait term contains an exponent over the iteration variable.

These rules have three properties we care about. They're fast: the whole layer runs in single-digit milliseconds on a typical diff, on-device, with no inference cost. They're deterministic: the same diff always produces the same tags, which makes the system debuggable. And they're precise: when a rule fires, it is almost always right, because it matched structure rather than vibes.

What they are not is complete. Rules can only recognise patterns someone wrote a query for. The current query set covers TypeScript, Python, Go, Rust, and SQL well; Java, Swift, and C++ are in the backlog. And even within a covered language, novel libraries and domain-specific idioms sail past the rules untouched.

Layer two: the LLM fallback

Anything the rule layer doesn't claim goes to an LLM with a structured extraction prompt. The model sees the diff, the file paths, and the rule layer's existing tags, and returns candidate concepts as structured output: a concept identifier from our taxonomy, a span in the diff, and a confidence score.

The fallback is what makes the system general, and it's also where all the risk lives. Language models are enthusiastic taggers. Ask one what concepts a diff touches and it will find seven, including two that are technically present and irrelevant and one that isn't there at all. Left unfiltered, that enthusiasm becomes a stream of spurious skill-graph entries and unjustified merge gates.

Why calibration matters more than accuracy

Here's the core design argument. Suppose tagger A is right 92% of the time, and when it's wrong, it's wrong with full confidence. Tagger B is right 88% of the time, but its confidence scores mean something: among tags it scores at 0.9, it's right about 90% of the time; among tags it scores at 0.6, about 60%.

Tagger B is far more useful, because a calibrated score is something the rest of the system can make decisions with. Above a high threshold, act: gate the merge, update the graph. In a middle band, soften: log the concept as "encountered" but don't interrupt anyone. Below the band, abstain: discard, and wait to see whether the pattern recurs. An uncalibrated 92% gives you nothing to threshold against; every decision is a coin flip weighted by an unknown amount.

So our headline internal metric for the fallback isn't accuracy, it's expected calibration error (ECE): bucket the predictions by stated confidence, compare each bucket's stated confidence to its actual hit rate, and average the gaps. We track ECE per language and per concept family, and a regression in ECE blocks a release even if raw accuracy improved.

Asymmetric costs, asymmetric thresholds

The second principle: the two error types do not cost the same, so they should not be treated the same.

A false positive (tagging a concept that isn't really there, or isn't really novel to this user) produces an unjustified interruption. The cost lands immediately, on the user, as annoyance and lost trust, and trust doesn't refund easily.

A false negative (missing a concept that was there) produces a lost teaching moment. The cost is real but deferred and usually recoverable: programming concepts recur, and a concept the tagger missed today will almost certainly reappear in another diff within weeks, when it gets another chance to be caught.

Given that asymmetry, the thresholds are deliberately conservative. The action threshold (the score above which a tag can trigger an explain-to-merge check) is set high enough that the expected false-interruption rate stays under one per engineer-week. Everything in the band below it is logged silently. The system would rather under-teach for a week than over-interrupt for a day, because only one of those failure modes makes people quit.

Abstention is part of the same design. The extraction prompt explicitly permits the model to return nothing, and the eval suite includes diffs whose correct output is the empty set: formatting changes, dependency bumps, comment edits. A tagger that can't say "no concepts here" with a straight face will hallucinate pedagogy onto a lockfile update.

The eval harness

None of the above is enforceable without measurement, so the eval harness is a first-class part of the codebase rather than a script someone runs before launch.

The core asset is a labelled corpus of a few thousand real diffs (drawn from open-source repositories and our own work, never from user data) annotated by two engineers against the concept taxonomy, with disagreements adjudicated and the labelling guideline updated whenever adjudication reveals an ambiguity. Inter-annotator agreement is tracked; where humans can't agree on whether a concept is present, we don't expect the tagger to, and the taxonomy entry gets revised instead.

Against that corpus, every candidate change to the rule set, the prompt, or the model runs the full suite and reports per-language precision and recall, ECE, abstention behaviour on the empty-set diffs, and a handful of named regression cases: specific past failures, each pinned in the suite so it can never silently return. Rule-layer changes additionally diff their firing patterns across the corpus, because a query edit that looks local can change behaviour on constructs its author never considered.

What we deliberately didn't build

Two roads not taken, for the record. We didn't fine-tune a custom tagging model, yet. Prompted frontier models with structured output, behind a strong rule layer and conservative thresholds, are good enough that fine-tuning would currently buy latency and cost improvements rather than quality, and it would cost us the ability to swap models freely, which matters for a bring-your-own-model product. We revisit this quarterly.

And we didn't build a learned end-to-end system that goes straight from diff to "should we interrupt this user." Collapsing tagging, novelty assessment, and interruption policy into one opaque decision would make every false interruption undebuggable. The pipeline is staged precisely so that when it's wrong, we can see which stage lied.

The tagger will be part of the open-source teaching kernel, query set included, before public beta. If you've built classification systems where the confidence score is the product, we'd genuinely like to compare notes: team@aiworklab.com.

Building a concept tagger that doesn't lie.