Entity Resolution
Last updated
Was this helpful?
Last updated
Was this helpful?
Business teams are dependent on trusted data to execute their growth initiatives. However, often the data is messy. One core reason for the messy data is to have duplicate records for various entities - Users, Organizations, Customers, Products or other objects.
These duplicate entities lead to chaos in CRM tools, inefficient marketing campaigns, incorrect analysis and wasted resources.
Entity Resolution helps you de-duplicate and associate records across all your data sources. Some key users cases include:
Removing duplicate records from your CRM (Salesforce, Hubspot, etc) applications.
Creating golden customer records from across various data sources.
Identity Resolution - Resolving anonymous users into real users
Associating different users into a common unit. For e.g. creating a household record from individual users
Entity Resolution helps you build out your Golden Record — a single source of truth for your business applications.
Some of the most common rules include
Matching users based on email address, mailing address or customer IDs
Matching companies based on their domain, company name or location
You can create complex rules using AND and OR operators across the rules.
Fuzzy Match
Jaro-Winkler similarity compares two strings to determine how similar they are, taking into account variations such as typos, order differences, and abbreviations. This algorithm is especially effective in handling messy real-world data and is commonly used in places like the US Census 🙂.
We provide three customizable thresholds for determining similarity:
Low (0.85)
John Smith vs. Jonathon Smythe,
ABC Technologies Inc. vs. ACB Tech
John Smith vs. Johnathan Smithson,
ABC Technologies Inc. vs. XYZ Tech
The names have some phonetic similarity, but the spelling and length differences result in low confidence.
Medium (0.9)
Acme Corporation vs. Acme Corp,
The Baker’s Delight vs. Baker’s Delight
Acme Corporation vs. Apex Corp,
The Baker’s Delight vs. The Delight
The names are similar, but abbreviations and structural differences lower the match confidence to medium.
High (0.95)
Census Data Inc. vs. Census Data,
Incorporated International Business Machines vs. IBM International Business Machines
Census Data Inc. vs. Census Analytics Inc.,
International Business Machines vs. Global Business Machines
The strings are nearly identical, with minor differences, leading to a high confidence match.
When building your match rules, you can choose what matching method you want to use for each column separately. You can mix exact match and fuzzy match rules in the same configuration.
Output Type
Census allows you to either merge duplicate records into one or mark them as duplicates. If marked as duplicate, Census will add additional columns to note its parent ID and if it's a duplicate. In addition, for marked as duplicate mode, when fuzzy match rules are applied, a similarity score column against the parent, is added for every fuzzy match rule.
If your source dataset looks like below, examples of how your deduplicated dataset will look like is shown below.
Source Dataset:
1
john@example.com
John Doe
123-356-6891
2
john@gmail.com
Johnny Doe
123-356-6891
3
jannet@example.com
Jannet Smith
245-891-9012
Merged Dataset: Fuzzy Match on Full Name with Medium Similarity, Resolve to Longest Email
1
john@example.com
John Doe
123-356-6891
3
jannet@example.com
Jannet Smith
245-891-9012
Marked as Duplicate Dataset: Fuzzy Match on Full Name with Medium Similarity, Resolve to Longest Email
1
1.0
true
john@example.com
John Doe
123-356-6891
2
0.96
true
john@gmail.com
Johnny Doe
123-356-6891
3
1.0
false
jannet@example.com
Jannet Smith
245-891-9012
Merge rules help you identify the winning record amongst the duplicates. The ID of the winning record becomes either the primary ID (aka the column configured as the unique ID on your source dataset) when merged, or the _census_parent_id
when unmerged, and is useful while syncing back to your business applications.
Census supports waterfall structure rules. So, the first rule is evaluated first and then the next until a record becomes a winning record.
Column Overrides help you override column values on the winning record. You can conditionally choose values for the final / resolved record.
There are multiple internal variables that determine how Census will run the Entity Resolution algorithm. Below is a list of these parameters and their default values. Contact Census if you would like to change any of these.
bands: how many slices of the signature will be used to determine if records are likely to be identical; the default is 4
hashes_per_band: the number of hash values from the signature will be used per band; the default is 8
use_first_char_blocking: only compare records that start with the same letter during fuzzy matches; default is false
use_sorted_neighborhood: enable fuzzy matching through sorting and comparing within a small window; turned on by default
use_deletion_key: enables comparing records to those with deleted characters from other records to catch variations with typos or missing letters; tuned on by default
number of passes: how many times the algorithm will run on the records; default is once
Entity resolution is performed by internal Census syncs that you can view the statuses of on the dataset page. You can also change the frequency at which these syncs run and manually trigger refreshes.
Census supports Deterministic Entity Resolution with at the column level. Deterministic Entity Resolution uses human-defined rules-based approach to identify duplicate records or associated users.
Match Rules are the criteria we use to identify duplicate or associated records. You can define these rules with a number of possible operations including Exact Match and .
Fuzzy matching is a technique used to identify and match similar strings that may not be identical. In Census, we want to ensure you always get a predictable & deterministic match. We use .
use_lsh: enable to find fuzzy matches; turned on by default
Entity Resolution generates a new dataset that is written back to , allowing you to query and sync from it. If you want to bring your own S3 bucket for Census store, checkout our docs .