A deterministic regex scanner over a curated word list — and why that beats LLM semantic categorization for controlled-vocabulary problems like medical coding.
if rules suppress obvious false positives. Every decision is deterministic and auditable.
756 entries in 23 categories. Each entry is just { w: "bestseller", s: "HIGH", a: "Popular, Acclaimed" } — the word, a severity, and a suggested replacement. Pure data, no code.
Every word becomes /\bword\b/i — word-boundary-aware, case-insensitive. The user's word is escaped first so punctuation like #1 or .com is matched literally. Entries are sorted longest-first so "Amazon bestseller" matches before "bestseller".
Counts how many times each word appears across all fields combined. Used only for keyword-stuffing detection (flag a generic word like "notebook" only if it shows up 3+ times).
Walks each field, runs every regex, and records hits. Overlap detection skips a short match already inside a longer phrase. Then deterministic suppression lists drop known false positives (see below).
Separate regex/length rules: title+subtitle > 200 chars, description > 2000 chars, disallowed HTML tags, and "hand-drawn / hand-illustrated" origin claims that contradict AI-generated disclosure.
Violations grouped by severity (CRITICAL / HIGH / MEDIUM / LOW), counts tallied, isClean flag set. Rendered on demand — the scan is never automatic since not every book targets KDP.
This is the core. Notice there is no model call anywhere — just RegExp, .exec(), and .match().
// 1. Compile each banned word into a word-boundary regex pattern: new RegExp("\\b" + escapeRegex(entry.w) + "\\b", "i") // 2. Longest phrases first, so "Amazon bestseller" wins over "bestseller" entries.sort((a, b) => b.word.length - a.word.length) // 3. Pass 1 — global counts for stuffing detection const matches = field.text.match(new RegExp(entry.pattern.source, "gi")) // 4. Pass 2 — does this field contain the word? const match = entry.pattern.exec(field.text) if (!match) continue; // no hit, move on // skip if inside an already-matched longer phrase const overlaps = matchedRanges.some(([s, e]) => start >= s && end <= e) if (overlaps) continue;
Because regex has no idea what a word means, the false positives are handled with explicit allow-lists — still 100% deterministic:
// "child", "kids", "young girl" are fine in a children's coloring book if (isChildrenBook && CHILDREN_SAFE_WORDS.has(wordLower)) continue; // "treat", "cure", "heal" are everyday words in a story description if (field.isDescription && NARRATIVE_SAFE_MEDICAL.has(wordLower)) continue; // "book", "gift", "new" only count as stuffing if repeated 3+ times if (entry.category === STUFFING_CATEGORY && totalCount < 3) continue;
This demo runs the same logic on a small sample of the real word list. Type or click a chip; matches highlight by severity.
The KDP checker is a clean illustration of a controlled-vocabulary classification problem — and your medical coding case (mapping text to ICD-10 / CPT / SNOMED codes from a known list) is the same shape. When the set of valid outputs is finite and defined in advance, regex against that list usually beats letting an LLM categorize semantically.
| Regex + word list this checker | LLM semantic categorization | |
|---|---|---|
| Determinism | Same input → same output, every time. Auditable. | Probabilistic; can drift between runs or model versions. |
| Explainability | "Matched \bcure\b in the title" — you can point at the exact rule. |
"The model thought so." Hard to defend to an auditor or regulator. |
| Cost & latency | Microseconds, free, runs in the browser. | API call per record; cost and latency scale with volume. |
| Updating the list | Edit a JSON row. No retraining, no prompt tuning. | Re-prompt or fine-tune; behavior can shift elsewhere. |
| Compliance/legal | Versioned, diffable, reproducible — what regulated domains want. | Black box; reproducibility is a real problem. |
| Handles synonyms / paraphrase | No — you must enumerate variants ("heal", "heals", "healing"). | Yes — generalizes to unseen phrasings. |
| Context & ambiguity | Blind to meaning; needs hand-coded suppression rules. | Understands "afternoon treat" ≠ medical "treat". |
The pragmatic takeaway: For a known code set with controlled terminology — medical coding, KDP banned words, profanity filters, regulatory keywords — a regex/dictionary layer should be the first pass. It's deterministic, cheap, and explainable. Reserve the LLM for the genuinely ambiguous residue: free-text notes where the right code depends on meaning a lookup can't capture. The KDP checker shows the hybrid in miniature: regex does the matching, and a short list of hand-written rules — not a model — resolves the predictable false positives.
Source: src/lib/kdpCompliance.ts (scanner) and src/data/kdp_banned_words.json (word list, v1.0.0, 756 entries / 23 categories). The scanner is invoked on demand from KdpComplianceDialog.tsx. Note Amazon publishes no official exhaustive list; the JSON is compiled from KDP guidelines plus community documentation.