Dedupe provider list by NPI (recruiter-proof SOP + CSV rules)

Ben Argeband, Founder & CEO of Heartbeat.ai — Very operational; reduce rework and candidate annoyance.

What’s on this page:

Who this is for

Ops and recruiters cleaning lists before outreach/enrichment. If you’re about to upload a file, assign recruiter ownership, or start outreach, this SOP prevents double-taps, duplicate submissions, and reporting noise.

Scope: you have a spreadsheet/CSV with some mix of NPI, name, specialty, practice location, and sometimes a state license. We’ll treat this as identity resolution: deciding when two rows represent the same provider, then keeping one survivor record with an audit trail.

Quick Answer

Core Answer: To dedupe provider list by NPI, normalize NPI to 10 digits, group by NPI, select one survivor per group, and log duplicates with source and recency.
Key Insight: NPI is the safest primary key; when it’s missing or invalid, fall back to state license, then name+city with strict normalization and a duplicate log.
Best For: Ops and recruiters cleaning lists before outreach/enrichment.

Compliance & Safety

This method is for legitimate recruiting outreach only. Always respect candidate privacy, opt-out requests, and local data laws. Heartbeat does not provide medical advice or legal counsel.

Framework: The “One Person, One Record” Rule: stop double-tapping candidates

Duplicates don’t just waste time. They create candidate annoyance (multiple calls/emails), internal confusion (two recruiters “own” the same physician), and bad economics (more dials per submittal, lower connectability, and messy attribution).

Deliverables when you follow this SOP:

SURVIVORS: one row per provider identity, ready for outreach or enrichment.
DUPLICATES_LOG: a mapped audit trail (duplicate_of, tier, reason, source, timestamp).
Before/after snapshot: duplicate rate and collision rate so ops can prove the cleanup worked.

One identity (the person) can have many attributes (emails, phones, locations, licenses).
Your working file should have one survivor row per identity, plus a duplicate log that preserves where the other rows came from.
Every dedupe decision should be explainable later (ops, compliance, recruiter handoffs).

In this SOP, recency is the tie-breaker: when two rows disagree, keep the most recently updated or most recently verified attribute, and record what changed.

Step-by-step method

Step 0: Make a safe working copy and add audit columns

Duplicate the file and add these columns (even if blank):

row_id (stable unique ID; if you don’t have one, create it)
dedupe_key (the key you used to group)
dedupe_tier (NPI / state license / name+city)
survivor_flag (TRUE/FALSE)
duplicate_of (row_id of survivor)
duplicate_reason (why you grouped them)
source (where the row came from)
last_updated (date field if you have it)

This is how you log duplicates without losing traceability.

Step 1: Normalize NPI (primary key)

NPI definition: The National Provider Identifier (NPI) is a unique 10-digit identifier for health care providers in the U.S.

Normalization rules (apply before grouping):

Strip spaces, hyphens, and non-numeric characters.
After cleaning, accept only values that are exactly 10 digits.
Anything else is treated as missing NPI and routed to fallback tiers.

Copy/paste rules (Google Sheets examples):

NPI_CLEAN (digits only): =REGEXREPLACE(A2,”[^0-9]”,”” )
NPI_VALID (10 digits): =IF(LEN(B2)=10,TRUE,FALSE)
NPI_FOR_KEY (blank if invalid): =IF(C2,B2,””)

Set dedupe_tier to “NPI” when NPI_VALID is TRUE.

Step 1.5: Separate individual vs organization records before dedupe

One common ops failure is mixing individual providers with organization records in the same file. Don’t try to “fix” this during dedupe—split it first.

If your source labels record type, split into two tabs: INDIVIDUAL and ORGANIZATION, then dedupe INDIVIDUAL only with this SOP.
If you don’t have a label, use a practical rule: if the row’s “name” field is clearly a facility/clinic name (not a person name), route it to ORGANIZATION for separate handling and exclude it from this person-level dedupe run.

This keeps your “one person, one record” logic clean and prevents nonsense groups.

Step 2: Group by NPI and choose a survivor record

Dedupe definition: Dedupe is the process of identifying multiple rows that represent the same real-world provider and retaining one survivor record while preserving an audit trail of the duplicates.

Group rows where NPI_FOR_KEY matches. For each NPI group:

Pick one survivor row (best contactability and freshest data).
Mark all other rows as duplicates and set duplicate_of to the survivor’s row_id.
Set duplicate_reason to “Same NPI after normalization”.

Survivor selection rules (recruiting order):

Row with the most recently verified direct contact fields (if you track verification date or last_updated).
Row with the most complete contact set (email + phone + location).
Row with the newest last_updated date.
If still tied, keep the row from your most trusted source and log why.

Multiple locations under one NPI (common):

Keep one identity (one survivor row_id) and treat locations as attributes.
If your system needs multiple location rows, keep them linked to the same survivor row_id (do not create multiple identities).
If two rows disagree on specialty, keep one identity and log the conflict; prefer the most recent source for the primary specialty field.

Step 3: Fallback keys when NPI is missing or invalid

Match key definition: A match key is the standardized identifier (or composite of fields) used to decide whether two records refer to the same provider.

Not every row will have a valid NPI. Your fallback tiers should be strict and documented:

Tier 2: state license. Normalize by removing spaces/punctuation and uppercasing. Group by (license state + license number). Set dedupe_tier to “state license”.
Tier 3: name + city fallback. Use only when NPI and state license are missing. Normalize name and city/state. Group by (last name + first initial + city + state). Set dedupe_tier to “name+city”.

The trade-off is… the deeper you go into fallback tiers, the higher the risk of false positives (two different people who look similar). That’s why you keep the duplicate log and keep the rules strict.

Step 4: Build and keep the duplicate log (non-negotiable)

Do not delete duplicates. Move them to a separate tab called DUPLICATES_LOG and keep these fields:

survivor row_id
duplicate row_id
dedupe_tier used (NPI / state license / name+city)
duplicate_reason
source for both rows
timestamp of the dedupe run

This prevents ops failures later when you need to reconcile submissions, outreach history, or suppression lists.

Diagnostic Table:

Use this to diagnose what kind of duplicates you have and what to do next. (Visual note: Add “dedupe rules” table + sample formulas (Sheets). Add “rules table” visual note.)

Symptom in your file	Likely cause	Best match key	What to do (CSV rules)	What to log
Same NPI appears on multiple rows	Multiple sources, repeated exports, or prior enrichment runs	NPI	Normalize to 10 digits; group; pick survivor by recency + completeness	duplicate_of survivor row_id; reason “Same NPI”
NPI missing on many rows	Older lists, partial exports, inconsistent upstream fields	state license	Normalize license; group by state+license; keep best contact row	Tier used + source priority
Two rows share name but different cities	Different people or provider moved	NPI (if present), otherwise keep separate	Do not dedupe unless NPI/license matches; treat as separate identities	Flag for review
Same name+city but different NPIs	Data entry error or two providers with similar names	NPI	Keep separate; validate NPI source; do not force consolidation	Conflict note
Facility/clinic names mixed into person rows	Organization records mixed into individual workflow	Pre-filter	Split into INDIVIDUAL vs ORGANIZATION tabs before dedupe	Filter rule used

Weighted Checklist:

Score each dedupe run before you upload/enrich/outreach. Total 100 points; if you’re under 80, fix the file before outreach or enrichment.

30 pts — NPI normalized to 10 digits and validated; invalid NPIs routed to fallback tiers
15 pts — Individual vs organization records separated before dedupe
15 pts — Fallback keys implemented: state license normalization + name+city normalization
15 pts — Survivor rules documented (recency + completeness + trusted source)
15 pts — DUPLICATES_LOG created with duplicate_of mapping, reasons, and timestamp
10 pts — Suppression-ready: opt-outs and “do not contact” flags carried to survivor record

Outreach Templates:

Use these when you discover you already contacted the provider under a different row. The goal is to stop double-taps and keep suppression clean.

Template 1: Apology + reset (same person, duplicate record)

Subject: Quick correction — one thread going forward

Hi Dr. {{LastName}} — I realized we had you in our system twice and you may have gotten more than one message from us. Sorry about that.

To keep it clean, I’m consolidating to one record and one point of contact on our side. If you’d prefer no outreach from us, reply “opt out” and I’ll suppress it.

If you’re open to a quick call, what’s the best number and time window?

— {{YourName}}, {{Role}}

Template 2: Internal note to recruiter (handoff after dedupe)

Subject: Dedupe complete for {{ListName}} — use survivor IDs only

Team — dedupe provider list is complete. Use only rows where survivor_flag=TRUE. Duplicates are in DUPLICATES_LOG with duplicate_of mapping and reasons.

Reminder: do not outreach from duplicate rows; it will double-tap the same physician and break attribution.

Template 3: Source escalation (bad upstream feed)

We’re seeing repeated duplicates caused by inconsistent NPI formatting and missing license fields. Please standardize NPI to 10 digits and include state license where available. We can share our CSV rules if helpful.

Common pitfalls

Using name-only dedupe. Name alone is not a key. If you must use name+city, keep it as a last resort and keep groups small and reviewable.
Deleting duplicates instead of logging them. When someone asks why a candidate got two emails, you need the audit trail.
Letting spreadsheets auto-format NPIs. Store cleaned NPI as text to avoid formatting issues.
Mixing organization records into person workflows. Split first; dedupe second.
Collapsing multiple locations into multiple identities. One provider can practice at multiple sites. Keep one identity record and store locations as attributes (or keep multiple location rows linked to the same survivor row_id).
Specialty conflicts across duplicates. Keep one identity; log conflicting specialty values and prefer the most recent source for the primary specialty field.
Not carrying suppression forward. If any duplicate row has an opt-out/do-not-contact flag, the survivor must inherit it.

How to improve results

1) Implement the CSV_RULES worksheet (repeatable and auditable)

This is the distinct element you can standardize across every inbound list. The goal: every file gets the same normalization, the same keys, and the same log format.

CSV column spec (recommended headers):

npi_raw (text)
npi_clean (text; digits only)
npi_valid (boolean)
license_state (text)
license_number_raw (text)
license_clean (text)
first_name, last_name (text)
city, state (text)
source (text)
last_updated (date)
dedupe_tier (text)
dedupe_key (text)
row_id (text)
survivor_flag (boolean)
duplicate_of (text)

Copy/paste rules (Google Sheets examples):

LICENSE_CLEAN: =UPPER(REGEXREPLACE(E2,”[^A-Za-z0-9]”,””))
NAME_KEY: =UPPER(H2)&”|”&LEFT(G2,1)
CITY_KEY: =UPPER(I2)&”|”&J2
DEDUP_KEY: =IF(C2,”NPI|”&B2,IF(F2<>””,”LIC|”&D2&”|”&F2,”NAMECITY|”&K2&”|”&L2))

Example dedupe_key outputs (what you should see in the file):

NPI tier: NPI|1234567890
state license tier: LIC|CA|A12345
name+city tier: NAMECITY|SMITH|J|AUSTIN|TX

Then sort by dedupe_key and your “best record” signals (last_updated, completeness) so the survivor is always the first row in each group.

2) Key choice matrix (precision vs coverage)

Tier	When to use	Main risk	What to log
NPI	Default when valid 10-digit NPI is present	Bad upstream NPI entry or org/person mixing	Normalization applied + survivor rule used
state license	When NPI is missing/invalid but license is present	License formatting differences across sources	State + cleaned license + source priority
name+city	Last resort only	False positives (similar names)	Exact normalization rules + review flag

3) Add measurement so you can prove dedupe is working

Measure this by… running a before/after report on duplicates removed and outreach collisions prevented, then tracking downstream contactability metrics.

Duplicate Rate = duplicate rows / total rows (report per 1,000 rows). Track by source.
Collision Rate = outreach attempts to duplicate identities / total outreach attempts (report per 100 attempts).
Connect Rate = connected calls / total dials (per 100 dials).
Answer Rate = human answers / connected calls (per 100 connected calls).
Deliverability Rate = delivered emails / sent emails (per 100 sent emails).
Bounce Rate = bounced emails / sent emails (per 100 sent emails).
Reply Rate = replies / delivered emails (per 100 delivered emails).

4) Export-ready output (what you hand to recruiters or upload)

Outreach/enrichment file: include only rows where survivor_flag=TRUE.
Required columns to keep: row_id, npi_clean (or NPI_FOR_KEY), dedupe_tier, source, last_updated, and your contact fields.
Audit file: keep DUPLICATES_LOG as a separate tab or separate CSV with survivor mapping and timestamp.

5) Where Heartbeat.ai fits after dedupe

Once you have one record per provider, enrichment and outreach workflows behave: fewer wasted lookups, fewer duplicate touches, cleaner attribution. Heartbeat.ai can support this by keeping identity resolution consistent and ranked mobile numbers by answer probability so recruiters spend dials where they’re most likely to connect.

Legal and ethical use

This is operational guidance for recruiting workflows, not legal advice. Use dedupe and identity resolution to reduce duplicate outreach and respect candidate preferences.

Honor opt-outs immediately and propagate suppression flags to the survivor record.
Limit access to raw lists; keep the duplicate log for audit, but don’t over-share it.
Only contact providers for legitimate recruiting opportunities and keep messages relevant.

Evidence and trust notes

NPI is a federal identifier with official documentation. These sources define NPI and its role as a standard identifier:

For how Heartbeat.ai evaluates data quality and operational trust, review: Heartbeat.ai trust methodology and definitions.

If your workflow includes matching NPI to licensing, use this sibling playbook next: NPI to state license matching (ops workflow).

FAQs

Is NPI always the best way to dedupe provider records?

For individual providers, NPI is usually the cleanest primary key because it’s designed as an identifier. When NPI is missing or invalid, use state license next, then name+city as a last resort with strict normalization and logging.

What should I do when two rows have the same NPI but different emails or phones?

Keep one survivor record per NPI and treat emails/phones as attributes. Prefer the most recent, most verified contact points, and keep the other values in notes or an attribute table if your system supports it. Always keep the duplicate log.

What do I do with rows that fail NPI validation?

Route them to the fallback tiers: try state license first, then name+city only if you have no better key. Flag name+city groups for review and keep the dedupe_tier in your log so you can audit false positives later.

How do I prevent dedupe from breaking recruiter ownership and attribution?

Run dedupe before assignment. Use survivor row_ids as the only assignable records, and map duplicates to survivors in DUPLICATES_LOG so historical activity can be reconciled without creating new “owners” for the same person.

What’s the minimum I need to keep for an audit trail?

Survivor row_id, duplicate row_id, dedupe tier used, reason, source, and timestamp. That’s enough to explain what happened and to debug upstream list quality.

Next steps

If you’re about to enrich: prep and upload your physician list for enrichment after you dedupe.
If you need fast coverage first: run a bulk physician lookup, then dedupe the combined output before outreach.
If you want this running as a repeatable workflow in Heartbeat.ai: create an account and set up your list workflow.

About the Author

Ben Argeband is the Founder and CEO of Swordfish.ai and Heartbeat.ai. With deep expertise in data and SaaS, he has built two successful platforms trusted by over 50,000 sales and recruitment professionals. Ben’s mission is to help teams find direct contact information for hard-to-reach professionals and decision-makers, providing the shortest route to their next win. Connect with Ben on LinkedIn.