MAR 21, 2026Methods

Fuzzy matching records: catching the duplicate the invoice number hides

An AP team paid the same vendor invoice twice, and the reason it slipped through is the useful part: "Because the invoice numbers were slightly different, QuickBooks didn't catch it." That isn't a software failure — it's what happens when you match on one raw key and trust it. Exact matching only works when both systems write the key the same way, and they almost never do. Fuzzy matching is the fix, but its job is widely misread: it surfaces look-alike pairs for you to confirm, not to auto-merge anything. Here's how to normalize keys so more matches land cleanly, how to match on what doesn't drift when the key is hopeless, and the duplicate-payment test that would have caught that doubled invoice.

Why didn't the software catch the duplicate?

The honest answer isn't "invoice numbers are hard." It's that almost every out-of-the-box match does exact, one-to-one matching on a single key. An operator named the mechanism precisely in a thread titled "Reconciliation tools are just fancy VLOOKUPs and nobody wants to say it": "almost everything still does 1:1 exact matching. Meaning — if your ERP spits out ORD-2024-10042 and your bank statement has 10042, the tool says unmatched." Same defect in the duplicate case: two entries that are really one bill look like two records because one key reads INV-4021 and the other 4021-A. The match did exactly what you asked. You asked the wrong question.

The seed thread's best answer was honest but not a method. A commenter replied that "Its hard when the invoice numbers are different, that makes it harder to catch" and fell back on a person noticing — "there are only 1-2 AP people and usually they would catch the dupe invoice." That holds until volume, turnover, or a second cost category outruns one person's memory, which is exactly how this one got through. The fix isn't a sharper pair of eyes. It's matching on the fields that don't drift.

What fuzzy matching is — and the one thing it must not do

Fuzzy matching compares records that aren't identical and scores how alike they are, so near-misses surface instead of silently failing an exact-match test. The formal name for the whole problem is record linkage, and the probabilistic flavor is literally called fuzzy matching — pairs above a similarity threshold are matches, below it are non-matches, and the band in between are "possible matches" a person decides.

Here is the discipline that everything else hangs on: a similarity score is a candidate flag, not a verdict. It tells you "these two might be the same — look." It must never auto-merge or auto-clear on its own, because the same machinery that pairs INV-4021 with 4021-A will just as happily pair two genuinely different invoices that happen to share an amount. Surface, then confirm. Skip the confirm and you've only traded missed duplicates for invented ones.

Normalize before you match (this does most of the work)

Before any scoring, most "non-matching" keys are just formatted differently. Normalization is the cheap, boring step that turns the majority of your near-misses into clean exact matches. Do it to both sides, into a new helper column, and match on that — leave the raw key untouched so you can always trace back.

What driftsNormalize it toWhy it matters
Leading / trailing spacesTrimmed textA trailing space makes 4021 ≠ "4021 "
Upper vs lower caseOne caseinv-4021 and INV-4021 are the same bill
Prefixes / suffixes (INV-, -A, /2024)The stable core, often just the digitsThis is the exact gap that hid the duplicate
Leading zeros (0004021)Stripped or padded consistentlyText-vs-number formatting flips these silently
Separators (- / .)RemovedORD-2024-10042 collapses toward 10042
=TRIM(A2)                                       ' strip stray spaces
=UPPER(TRIM(A2))                                ' spaces + case in one pass
=SUBSTITUTE(SUBSTITUTE(UPPER(TRIM(A2)),"INV-",""),"-A","")   ' drop known prefix/suffix
=VALUE(A2)                                      ' force a number, killing leading-zero drift

This is the same clean-key discipline that makes an XLOOKUP reconciliation work at all — XLOOKUP and VLOOKUP do exact matching, so they only reward you when the keys are already identical, and normalizing into a helper column is how you earn that. TRIM removes only the space character, so pair it with UPPER and SUBSTITUTE for the rest. The thing you're rebuilding here is a stable primary ID — when the systems never shared one, this is the work that fakes one well enough to match on.

When the key is hopeless, match on what doesn't drift

Sometimes there's no recoverable key — the two sides never shared one, or the invoice number is genuinely different on each. Stop trying to fix the key and match on the fields that don't move: the amount, the counterparty, and the date within a window. A duplicate payment almost always keeps the amount and the vendor identical even when the invoice number drifts. That's the signature you hunt for.

  1. Build a composite key from stable fields: normalized vendor + amount + period (month). Concatenate them into one column on both sides.
  2. Sort by that composite key so identical bills sit next to each other.
  3. Flag exact composite collisions. Two rows with the same vendor, same amount, and same period are duplicate candidates regardless of invoice number.
  4. Widen the date to a window, not an exact day. Real duplicates often post a few days apart — compare within plus-or-minus five days rather than on an exact date.
  5. Review each flagged pair by hand. Same bill paid twice, or two real invoices that happen to share an amount? Only a person closes that.
' Composite key on each side:
=UPPER(TRIM(Vendor)) & "|" & TEXT(Amount,"0.00") & "|" & TEXT(Date,"YYYY-MM")

' Flag any composite key that appears more than once:
=COUNTIF(KeyCol,K2)>1

Flagging same-key collisions is the same set-difference logic you'd use to find missing rows — you're just hunting the opposite signal: rows that appear too many times instead of too few.

Scoring near-duplicates when you can't enumerate the rules

Composite keys catch duplicates where you can name the stable fields. When you can't — free-text descriptions, customer names, addresses — you need a similarity score: a number from 0 to 1 for how alike two strings are. The standard measures, and where each one fits:

MeasureBest forHow it works
Levenshtein / edit distanceTypos and transpositions in codesCounts the single-character edits to turn one string into the other
Jaro–WinklerShort strings like namesRewards a shared prefix; built for record linkage
Jaccard (token overlap)Reordered words ("Acme Inc" vs "Inc, Acme")Overlap over union of the words; what Power Query uses

In a spreadsheet you don't implement these by hand. Power Query's fuzzy merge exposes a similarity threshold from 0.00 to 1.00, default 0.80, where 1.00 allows only exact matches and lower values match more loosely. Turn on "show similarity scores" so you can see why each pair matched, and tune the threshold against your own data rather than trusting the default.

The duplicate-payment test, end to end

Put it together against the original problem: a bill paid twice under slightly different invoice numbers. Here's the pass that catches it.

  1. Pull both payment runs into one sheet with vendor, amount, invoice number, date, and cost category in columns.
  2. Normalize the vendor and invoice number into helper columns (trim, upper, strip prefixes).
  3. Build the composite key: vendor + amount + month. Ignore the invoice number for this pass — it's the field that lied.
  4. COUNTIF the composite key. Anything appearing more than once is a candidate.
  5. For each candidate, widen to a five-day window and read the two rows. Same vendor, same amount, days apart, two different cost categories is the exact fingerprint of this error.
  6. Confirm and resolve by hand: void the duplicate, document which one was real.

This is the failure mode behind a complaint that a reconciliation can say everything matched and still be wrong. One commenter described how auditors found "duplicate vendor payments that looked clean because amounts offset perfectly." A top-line "MATCHED" only proves the totals tie; it says nothing about whether two line items are secretly the same bill. The duplicate test above is what looks underneath the checkmark.

When this stops scaling

A helper column and a COUNTIFS pass are the right tools for a few hundred rows you check monthly — it's the manual method with a normalization step bolted on, and it pairs naturally with a bank reconciliation or a month-end tick-and-tie where a doubled entry is exactly what you're watching for. It strains when the same dedup runs every week, when there are thousands of rows, or when "is this really the same vendor" needs judgment a formula can't encode. At that point you want the normalization rules and the match logic written down once and run the same way every time, not rebuilt from memory. If you reach for AI there, use it where it's actually strong — describing the normalization rules and explaining candidate pairs — and keep the arithmetic and the matching in code you can audit, so "these are duplicates" stays a claim you can defend.

Frequently asked questions

Why didn't QuickBooks catch the duplicate invoice?

Because it matched on the invoice number, and the two entries had slightly different numbers. Out-of-the-box matching is exact and one-to-one on a single key, so any drift in that key reads as two separate records. Matching on stable fields instead — vendor, amount, and a date window — catches the duplicate the invoice number hides.

What is fuzzy matching in reconciliation?

Comparing records that are not identical and scoring how similar they are, so near-misses surface instead of silently failing an exact match. It is a form of record linkage. Its output is candidate pairs for a person to confirm, not an automatic merge.

How do I find duplicate payments in Excel?

Normalize the vendor and amount into helper columns, build a composite key of vendor plus amount plus month, and use COUNTIFS to flag any key that appears more than once. Review each flagged pair within a few-day window by hand; a duplicate keeps the amount and vendor even when the invoice number differs.

Should fuzzy matching merge records automatically?

No. A similarity score flags candidates; it should never auto-merge or auto-clear. The same scoring that pairs a mistyped key with its original will also pair two genuinely different records that share an amount. Always confirm by hand before resolving.

What similarity threshold should I use for fuzzy matching?

There is no universal number. In Power Query the default is 0.80, where 1.00 is an exact match and lower values match more loosely. Start generous to surface candidates, turn on similarity scores to see why pairs matched, and tune against your own data — but always confirm matches manually rather than clearing on the score alone.