Entity Resolution (Invite Only)
Last updated
Last updated
Business teams are dependent on trusted data to execute their growth initiatives. However, often the data is messy. One core reason for the messy data is to have duplicate records for various entities - Users, Organizations, Customers, Products or other objects.
These duplicate entities lead to chaos in CRM tools, inefficient marketing campaigns, incorrect analysis and wasted resources.
Entity Resolution helps you de-duplicate and associate records across all your data sources. Some key users cases include:
Removing duplicate records from your CRM (Salesforce, Hubspot, etc) applications.
Creating golden customer records from across various data sources.
Identity Resolution - Resolving anonymous users into real users
Associating different users into a common unit. For e.g. creating a household record from individual users
Entity Resolution is a way to structure your data to help create Golden Record—a single source of truth for your business applications.
Census supports Deterministic Entity Resolution with Fuzzy Match at the column level. Deterministic Entity Resolution uses human-defined rules-based approach to identify duplicate records or associated users and merge them into a single record.
Match Rules are the criteria we use to identify duplicate or associated records. You can define these rules with a number of possible operations including Exact Match and Fuzzy Match.
Some of the most common rules include
Matching users based on email address, mailing address or customer IDs
Matching companies based on their domain, company name or location
You can create complex rules using AND and OR operators across the rules.
Fuzzy matching is a technique used to identify and match similar strings that may not be identical. In Census, we want to ensure you always get a predictable & deterministic match. We use Jaro-Winkler similarity.
Jaro-Winkler similarity compares two strings to determine how similar they are, taking into account variations such as typos, order differences, and abbreviations. This algorithm is especially effective in handling messy real-world data and is commonly used in places like the US Census 🙂.
We provide three customizable thresholds for determining similarity:
Low (0.6)
John Smith vs. Jonathon Smythe,
ABC Technologies Inc. vs. ACB Tech
John Smith vs. Johnathan Smithson,
ABC Technologies Inc. vs. XYZ Tech
The names have some phonetic similarity, but the spelling and length differences result in low confidence.
Medium (0.8)
Acme Corporation vs. Acme Corp,
The Baker’s Delight vs. Baker’s Delight
Acme Corporation vs. Apex Corp,
The Baker’s Delight vs. The Delight
The names are similar, but abbreviations and structural differences lower the match confidence to medium.
High (0.9)
Census Data Inc. vs. Census Data,
Incorporated International Business Machines vs. IBM International Business Machines
Census Data Inc. vs. Census Analytics Inc.,
International Business Machines vs. Global Business Machines
The strings are nearly identical, with minor differences, leading to a high confidence match.
When building your match rules, you can choose whether what matching method you want to use for each column separately. You can mix exact match and fuzzy match rules in the same configuration.
Merge rules help you identify the winning record among the duplicates. The ID of the winning record becomes the primary ID and is useful while syncing back to your business applications.
Census supports waterfall structure rules. So, the first rule is evaluated first and then the next until a record becomes a winning record.
When you leave your merge rules empty, Census uses a record with lowest ID as the winning record.
Column Overrides help you override column values on the winning record. You can conditionally choose values for the final / resolved record.
Entity Resolution generates a new dataset that is written back to your data warehouse under the Census Schema.
Entity Resolution is supported on Snowflake, BigQuery, Redshift and Postgres with support for other warehouses and data sources coming soon.