Entity Resolution

Business teams are dependent on trusted data to execute their growth initiatives. However, often the data is messy. One core reason for the messy data is to have duplicate records for various entities - Users, Organizations, Customers, Products or other objects.

These duplicate entities lead to chaos in CRM tools, inefficient marketing campaigns, incorrect analysis and wasted resources.

Quickstart — Using Entity Resolution

The Entity Resolution flow can be kicked off from 2 places.

From the Datasets tab, select the dataset you would like to deduplicate from the list, and click the "Deduplicate" button on top of the list.
From a given Dataset, click "Deduplicate" in the top header to kick off the flow for that dataset.

Entity Resolution

Entity Resolution helps you de-duplicate and associate records across all your data sources. Some key users cases include:

Removing duplicate records from your CRM (Salesforce, Hubspot, etc) applications.
Creating golden customer records from across various data sources.
Identity Resolution - Resolving anonymous users into real users
Associating different users into a common unit. For e.g. creating a household record from individual users

Entity Resolution helps you build out your Golden Record — a single source of truth for your business applications.

Core Concepts

Census supports Deterministic Entity Resolution with Fuzzy Match at the column level. Deterministic Entity Resolution uses human-defined rules-based approach to identify duplicate records or associated users.

Match Rules

Match Rules are the criteria we use to identify duplicate or associated records. You can define these rules with a number of possible operations including Exact Match and Fuzzy Match.

Some of the most common rules include

Matching users based on email address, mailing address or customer IDs
Matching companies based on their domain, company name or location

You can create complex rules using composite AND and OR operators across the rules.

Fuzzy Match

Fuzzy matching is a technique used to identify and match similar strings that may not be identical. In Census, we want to ensure you always get a predictable & deterministic match. We use Jaro-Winkler similarity.

Jaro-Winkler similarity compares two strings to determine how similar they are, taking into account variations such as typos, order differences, and abbreviations. This algorithm is especially effective in handling messy real-world data and is commonly used in places like the US Census 🙂.

We provide three customizable thresholds for determining similarity:

Confidence Level

Matched

Not Matched

Explanation

Low (0.85)

John Smith vs. Jonathon Smythe,

ABC Technologies Inc. vs. ACB Tech

John Smith vs. Johnathan Smithson,

ABC Technologies Inc. vs. XYZ Tech

The names have some phonetic similarity, but the spelling and length differences result in low confidence.

Medium (0.9)

Acme Corporation vs. Acme Corp,

The Baker’s Delight vs. Baker’s Delight

Acme Corporation vs. Apex Corp,

The Baker’s Delight vs. The Delight

The names are similar, but abbreviations and structural differences lower the match confidence to medium.

High (0.95)

Census Data Inc. vs. Census Data,

Incorporated International Business Machines vs. IBM International Business Machines

Census Data Inc. vs. Census Analytics Inc.,

International Business Machines vs. Global Business Machines

The strings are nearly identical, with minor differences, leading to a high confidence match.

When building your match rules, you can choose what matching method you want to use for each column separately. You can mix exact match and fuzzy match rules in the same configuration.

Output Type

Census allows you to either merge duplicate records into one or mark them as duplicates. If marked as duplicate, Census will add additional columns to note its parent ID and if it's a duplicate. In addition, for marked as duplicate mode, when fuzzy match rules are applied, a similarity score column against the parent, is added for every fuzzy match rule (ie, _census_email_similarity_with_parent).

If your source dataset looks like below, examples of how your deduplicated dataset will look like is shown below.

Source Dataset:

FULL_NAME

PHONE

john@example.com

John Doe

123-356-6891

john@gmail.com

Johnny Doe

123-356-6891

jannet@example.com

Jannet Smith

245-891-9012

Merged Dataset: Fuzzy Match on Full Name with Medium Similarity, Resolve to Longest Email

_census_id

_census_entity_resolution_lineage

FULL_NAME

PHONE

cid_a209634ab08e48eeab7fd79f

[1, 2]

john@example.com

John Doe

123-356-6891

cid_b6447e18e2954750a369bb16

[3]

jannet@example.com

Jannet Smith

245-891-9012

Marked as Duplicate Dataset: Fuzzy Match on Full Name with Medium Similarity, Resolve to Longest Email

_census_id

_census_parent_id

_census_email_similarity_with_parent

_census_has_duplicates

FULL_NAME

PHONE

cid_a209634ab08e48eeab7fd79f

1.0

true

john@example.com

John Doe

123-356-6891

cid_a209634ab08e48eeab7fd79f

0.96

true

john@gmail.com

Johnny Doe

123-356-6891

cid_b6447e18e2954750a369bb16

1.0

false

jannet@example.com

Jannet Smith

245-891-9012

Census IDs

All records in a resolved dataset will automatically be assigned a unique ID called the "Census ID" (_census_id). This is a Census-generated ID that represents a single resolved entity in your dataset. For instance, if 2 records in the base dataset get resolved into a single entity, they will both have the same _census_id value.

Census IDs will remain "stable" as your dataset changes. For instance, if a group of records make up the "John Smith" entity in your resolved dataset, that entity will retain the same Census ID even as individual records that contribute to the entity enter or leave your base dataset.

Census Lineage column

In Merged Entity Resolution Datasets, the _census_entity_resolution_lineage column is an array of the source dataset record IDs that contribute to the given golden record.

Merge Rules

Merge rules help you identify the winning record amongst the duplicates. The ID of the winning record becomes either the primary ID (aka the column configured as the unique ID on your source dataset) when merged, or the _census_parent_id when unmerged, and is useful while syncing back to your business applications.

Census supports waterfall structure rules. So, the first rule is evaluated first and then the next until a record becomes a winning record.

Column Overrides

Column Overrides help you override column values on the winning record. You can conditionally choose values for the final / resolved record.

Default Internal Variables

There are multiple internal variables that determine how Census will run the Entity Resolution algorithm. Below is a list of these parameters and their default values. Contact Census if you would like to change any of these.

bands: how many slices of the signature will be used to determine if records are likely to be identical; the default is 4
hashes_per_band: the number of hash values from the signature will be used per band; the default is 8
use_first_char_blocking: only compare records that start with the same letter during fuzzy matches; default is false
use_lsh: enable Locality Sensitive Hashing to find fuzzy matches; turned on by default
use_sorted_neighborhood: enable fuzzy matching through sorting and comparing within a small window; turned on by default
use_deletion_key: enables comparing records to those with deleted characters from other records to catch variations with typos or missing letters; tuned on by default
number of passes: how many times the algorithm will run on the records; default is once

Materialization

Entity Resolution generates a new dataset that is written back to Census Store, allowing you to query and sync from it. If you want to bring your own S3 bucket for Census store, checkout our docs here.

Entity resolution is performed by internal Census syncs that you can view the statuses of on the dataset page. You can also change the frequency at which these syncs run and manually trigger refreshes.

FAQ

How does Census handle NULLs for fields that are used in matching and column overrides?

For matching, records with NULL values for that match rule will be ignored. For instance, if you have a rule to match records when they have the exact same email, we will not match 2 records that have NULL emails.

For merging and column overrides, you can be explicit about how you want NULLs to be treated in your configuration. For instance, you can set a column override on the winning record to select non-NULL values for specific columns:

Can I be alerted when an Entity Resolution refresh fails?

Yes, you can be notified if a refresh fails by subscribing to Sync Alerts for the syncs powering Entity Resolution, which are linked to from the Deduplicate dataset status bar:

PreviousRelationships NextSmart Columns

Last updated 27 days ago

Was this helpful?