$ man deduplication
Deduplication
Checking for and removing duplicate records before processing. If a company was already enriched in a previous run, skip it. If a contact appears twice with slightly different names, merge them.
duplicate records waste API credits, create confusion in CRMs, and inflate your pipeline numbers. if you run an enrichment script twice without dedup, you get duplicate rows. if you import contacts to HubSpot without dedup, you get duplicate contact objects. if you send emails to duplicates, the same person gets two identical messages and marks you as spam. deduplication is a gate that should exist at every handoff point in the pipeline.
I build dedup into every script. at the start of a batch run, I load the existing output CSV and build a set of already-processed domains or emails. before each API call, I check: is this domain already in the set? if yes, skip. if no, process and add to the set. this means I can re-run scripts safely — if a run fails at record 40 of 73, I restart and it picks up at record 41 automatically. for CRM dedup, Clay has Sculptor (AI-powered fuzzy matching) that catches "Microsoft" vs "Microsoft Corporation" vs "MSFT." for contact-level dedup, I use email as the primary key — if the email already exists in HubSpot, update the record instead of creating a new one.