How the KDP Compliance Checker Works

A deterministic regex scanner over a curated word list — and why that beats LLM semantic categorization for controlled-vocabulary problems like medical coding.

Short answer: Yes — it is primarily regex matching against a curated word list. No LLM is involved in the classification. A JSON file holds ~756 banned words/phrases, each compiled into a word-boundary regex. The text is scanned, matches are collected, and a handful of hand-coded if rules suppress obvious false positives. Every decision is deterministic and auditable.

The architecture in one picture

Word list as data kdp_banned_words.json

756 entries in 23 categories. Each entry is just { w: "bestseller", s: "HIGH", a: "Popular, Acclaimed" } — the word, a severity, and a suggested replacement. Pure data, no code.

Compile each word to a regex getEntries()

Every word becomes /\bword\b/i — word-boundary-aware, case-insensitive. The user's word is escaped first so punctuation like #1 or .com is matched literally. Entries are sorted longest-first so "Amazon bestseller" matches before "bestseller".

Pass 1 — count occurrences everywhere

Counts how many times each word appears across all fields combined. Used only for keyword-stuffing detection (flag a generic word like "notebook" only if it shows up 3+ times).

Pass 2 — collect violations with context suppression

Walks each field, runs every regex, and records hits. Overlap detection skips a short match already inside a longer phrase. Then deterministic suppression lists drop known false positives (see below).

Structural checks also regex

Separate regex/length rules: title+subtitle > 200 chars, description > 2000 chars, disallowed HTML tags, and "hand-drawn / hand-illustrated" origin claims that contradict AI-generated disclosure.

Aggregate & return

Violations grouped by severity (CRITICAL / HIGH / MEDIUM / LOW), counts tallied, isClean flag set. Rendered on demand — the scan is never automatic since not every book targets KDP.

The actual matching code

This is the core. Notice there is no model call anywhere — just RegExp, .exec(), and .match().

// 1. Compile each banned word into a word-boundary regex
pattern: new RegExp("\\b" + escapeRegex(entry.w) + "\\b", "i")

// 2. Longest phrases first, so "Amazon bestseller" wins over "bestseller"
entries.sort((a, b) => b.word.length - a.word.length)

// 3. Pass 1 — global counts for stuffing detection
const matches = field.text.match(new RegExp(entry.pattern.source, "gi"))

// 4. Pass 2 — does this field contain the word?
const match = entry.pattern.exec(field.text)
if (!match) continue;          // no hit, move on

// skip if inside an already-matched longer phrase
const overlaps = matchedRanges.some(([s, e]) => start >= s && end <= e)
if (overlaps) continue;

The only "intelligence" is hand-coded context rules

Because regex has no idea what a word means, the false positives are handled with explicit allow-lists — still 100% deterministic:

// "child", "kids", "young girl" are fine in a children's coloring book
if (isChildrenBook && CHILDREN_SAFE_WORDS.has(wordLower)) continue;

// "treat", "cure", "heal" are everyday words in a story description
if (field.isDescription && NARRATIVE_SAFE_MEDICAL.has(wordLower)) continue;

// "book", "gift", "new" only count as stuffing if repeated 3+ times
if (entry.category === STUFFING_CATEGORY && totalCount < 3) continue;

Try it — live regex scan

This demo runs the same logic on a small sample of the real word list. Type or click a chip; matches highlight by severity.

Book metadata (title / subtitle / description)

Highlighted

Why this matters for your medical coding project

The KDP checker is a clean illustration of a controlled-vocabulary classification problem — and your medical coding case (mapping text to ICD-10 / CPT / SNOMED codes from a known list) is the same shape. When the set of valid outputs is finite and defined in advance, regex against that list usually beats letting an LLM categorize semantically.

	Regex + word list this checker	LLM semantic categorization
Determinism	Same input → same output, every time. Auditable.	Probabilistic; can drift between runs or model versions.
Explainability	"Matched `\bcure\b` in the title" — you can point at the exact rule.	"The model thought so." Hard to defend to an auditor or regulator.
Cost & latency	Microseconds, free, runs in the browser.	API call per record; cost and latency scale with volume.
Updating the list	Edit a JSON row. No retraining, no prompt tuning.	Re-prompt or fine-tune; behavior can shift elsewhere.
Compliance/legal	Versioned, diffable, reproducible — what regulated domains want.	Black box; reproducibility is a real problem.
Handles synonyms / paraphrase	No — you must enumerate variants ("heal", "heals", "healing").	Yes — generalizes to unseen phrasings.
Context & ambiguity	Blind to meaning; needs hand-coded suppression rules.	Understands "afternoon treat" ≠ medical "treat".

The pragmatic takeaway: For a known code set with controlled terminology — medical coding, KDP banned words, profanity filters, regulatory keywords — a regex/dictionary layer should be the first pass. It's deterministic, cheap, and explainable. Reserve the LLM for the genuinely ambiguous residue: free-text notes where the right code depends on meaning a lookup can't capture. The KDP checker shows the hybrid in miniature: regex does the matching, and a short list of hand-written rules — not a model — resolves the predictable false positives.

Source: src/lib/kdpCompliance.ts (scanner) and src/data/kdp_banned_words.json (word list, v1.0.0, 756 entries / 23 categories). The scanner is invoked on demand from KdpComplianceDialog.tsx. Note Amazon publishes no official exhaustive list; the JSON is compiled from KDP guidelines plus community documentation.