LogoLogo
  • 🦩Overview
  • 💾Datasets
    • Overview
    • Core Concepts
      • Columns & Annotations
      • Type & Property Mappings
      • Relationships
    • Basic Datasets
      • dbt Integration
      • Sigma Integration
      • Looker Integration
    • SaaS Datasets
    • CSV Datasets
    • Streaming Datasets
    • Entity Resolution
    • AI Columns
      • AI Prompts Recipe Book
    • Enrichment Columns
      • Quick Start
      • HTTP Request Enrichments
    • Computed Columns
    • Version Control
  • 📫Syncs
    • Overview
    • Triggering & Scheduling
    • Retry Handling
    • Live Syncs
    • Audience Syncs
    • Observability
      • Current Sync Run Overview
      • Sync History
      • Sync Tracking
      • API Inspector
      • Sync Alerts
      • Observability Lake
      • Datadog Integration
      • Warehouse Writeback
      • Sync Lifecycle Webhooks
      • Sync Dry Runs
    • Structuring Data
      • Liquid Templates
      • Event Syncs
      • Arrays and Nested Objects
  • 👥Audience Hub
    • Overview
    • Creating Segments
      • Segment Priorities
      • Warehouse-Managed Audiences
    • Experiments and Analysis
      • Audience Match Rates
    • Activating Segments
    • Calculated Columns
    • Data Preparation
      • Profile Explorer
      • Exclusion Lists
  • 🧮Data Sources
    • Overview
    • Available Sources
      • Amazon Athena
      • Amazon Redshift
      • Amazon S3
      • Azure Synapse
      • ClickHouse
      • Confluent Cloud
      • Databricks
      • Elasticsearch
      • Kafka
      • Google AlloyDB
      • Google BigQuery
      • Google Cloud SQL for PostgreSQL
      • Google Pub/Sub
      • Google Sheets
      • Greenplum
      • HTTP Request
      • HubSpot
      • Materialize
      • Microsoft Fabric
      • MotherDuck
      • MySQL
      • PostgreSQL
      • Rockset
      • Salesforce
      • SingleStore
      • Snowflake
      • SQL Server
      • Trino
  • 🛫Destinations
    • Overview
    • Available Destinations
      • Accredible
      • ActiveCampaign
      • Adobe Target
      • Aha
      • Airship
      • Airtable
      • Algolia
      • Amazon Ads DSP (AMC)
      • Amazon DynamoDB
      • Amazon EventBridge
      • Amazon Pinpoint
      • Amazon Redshift
      • Amazon S3
      • Amplitude
      • Anaplan
      • Antavo
      • Appcues
      • Apollo
      • Asana
      • AskNicely
      • Attentive
      • Attio
      • Autopilot Journeys
      • Azure Blob Storage
      • Box
      • Bloomreach
      • Blackhawk
      • Braze
      • Brevo (formerly Sendinblue)
      • Campaign Monitor
      • Canny
      • Channable
      • Chargebee
      • Chargify
      • ChartMogul
      • ChatGPT Retrieval Plugin
      • Chattermill
      • ChurnZero
      • CJ Affiliate
      • CleverTap
      • ClickUp
      • Constant Contact
      • Courier
      • Criteo
      • Crowd.dev
      • Customer.io
      • Databricks
      • Delighted
      • Discord
      • Drift
      • Drip
      • Eagle Eye
      • Emarsys
      • Enterpret
      • Elasticsearch
      • Facebook Ads
      • Facebook Product Catalog
      • Freshdesk
      • Freshsales
      • Front
      • FullStory
      • Gainsight
      • GitHub
      • GitLab
      • Gladly
      • Google Ads
        • Customer Match Lists (Audiences)
        • Offline Conversions
      • Google AlloyDB
      • Google Analytics 4
      • Google BigQuery
      • Google Campaign Manager 360
      • Google Cloud Storage
      • Google Datastore
      • Google Display & Video 360
      • Google Drive
      • Google Search Ads 360
      • Google Sheets
      • Heap.io
      • Help Scout
      • HTTP Request
      • HubSpot
      • Impact
      • Insider
      • Insightly
      • Intercom
      • Iterable
      • Jira
      • Kafka
      • Kevel
      • Klaviyo
      • Kustomer
      • Labelbox
      • LaunchDarkly
      • LinkedIn
      • LiveIntent
      • Loops
      • Mailchimp
      • Mailchimp Transactional (Mandrill)
      • Mailgun
      • Marketo
      • Meilisearch
      • Microsoft Advertising
      • Microsoft Dynamics
      • Microsoft SQL Server
      • Microsoft Teams
      • Mixpanel
      • MoEngage
      • Mongo DB
      • mParticle
      • MySQL
      • NetSuite
      • Notion
      • OneSignal
      • Optimizely
      • Oracle Database
      • Oracle Eloqua
      • Oracle Fusion
      • Oracle Responsys
      • Orbit
      • Ortto
      • Outreach
      • Pardot
      • Partnerstack
      • Pendo
      • Pinterest
      • Pipedrive
      • Planhat
      • PostgreSQL
      • PostHog
      • Postscript
      • Productboard
      • Qualtrics
      • Radar
      • Reddit Ads
      • Rokt
      • RollWorks
      • Sailthru
      • Salesforce
      • Salesforce Commerce Cloud
      • Salesforce Marketing Cloud
      • Salesloft
      • Segment
      • SendGrid
      • Sense
      • SFTP
      • Shopify
      • Singular
      • Slack
      • Snapchat
      • Snowflake
      • Split
      • Sprig
      • Stripe
      • The Trade Desk
      • TikTok
      • Totango
      • Userflow
      • Userpilot
      • Vero Cloud
      • Vitally
      • Webhooks
      • Webflow
      • X Ads (formerly Twitter Ads)
      • Yahoo Ads (DSP)
      • Zendesk
      • Zoho CRM
      • Zuora
    • Custom & Partner Destinations
  • 📎Misc
    • Credits
    • Census Embedded
    • Data Storage
      • Census Store
        • Query Census Store from Snowflake
        • Query Census Store locally using DuckDB
      • General Object Storage
      • Bring Your Own Bucket
        • Bring your own S3 Bucket
        • Bring your own GCS Bucket
        • Bring your own Azure Bucket
    • Developers
      • GitLink
      • Dataset API
      • Custom Destination API
      • Management API
    • Security & Privacy
      • Login & SSO Settings
      • Workspaces
      • Role-based Access Controls
      • Network Access Controls
      • SIEM Log Forwarding
      • Secure Storage of Customer Credentials
      • Digital Markets Act (DMA) Consent for Ad Platforms
    • Health and Usage Reporting
      • Workspace Homepage
      • Product Usage Dashboard
      • Observability Toolkit
      • Alerts
    • FAQs
Powered by GitBook
On this page
  • Quickstart — Using Entity Resolution
  • Entity Resolution
  • Core Concepts

Was this helpful?

  1. Datasets

Entity Resolution

PreviousStreaming DatasetsNextAI Columns

Last updated 17 days ago

Was this helpful?

Business teams are dependent on trusted data to execute their growth initiatives. However, often the data is messy. One core reason for the messy data is to have duplicate records for various entities - Users, Organizations, Customers, Products or other objects.

These duplicate entities lead to chaos in CRM tools, inefficient marketing campaigns, incorrect analysis and wasted resources.

Quickstart — Using Entity Resolution

The Entity Resolution flow can be kicked off from 2 places.

  1. From the Datasets tab, select the dataset you would like to deduplicate from the list, and click the "Deduplicate" button on top of the list.

  2. From a given Dataset, click "Deduplicate" in the top header to kick off the flow for that dataset.

Entity Resolution

Entity Resolution helps you de-duplicate and associate records across all your data sources. Some key users cases include:

  • Removing duplicate records from your CRM (Salesforce, Hubspot, etc) applications.

  • Creating golden customer records from across various data sources.

  • Identity Resolution - Resolving anonymous users into real users

  • Associating different users into a common unit. For e.g. creating a household record from individual users

Entity Resolution helps you build out your Golden Record — a single source of truth for your business applications.

Census Entity Resolution - De-duplication and Association

Core Concepts

Match Rules

Some of the most common rules include

  • Matching users based on email address, mailing address or customer IDs

  • Matching companies based on their domain, company name or location

You can create complex rules using AND and OR operators across the rules.

Fuzzy Match

Jaro-Winkler similarity compares two strings to determine how similar they are, taking into account variations such as typos, order differences, and abbreviations. This algorithm is especially effective in handling messy real-world data and is commonly used in places like the US Census 🙂.

We provide three customizable thresholds for determining similarity:

Confidence Level
Matched
Not Matched
Explanation

Low (0.85)

John Smith vs. Jonathon Smythe,

ABC Technologies Inc. vs. ACB Tech

John Smith vs. Johnathan Smithson,

ABC Technologies Inc. vs. XYZ Tech

The names have some phonetic similarity, but the spelling and length differences result in low confidence.

Medium (0.9)

Acme Corporation vs. Acme Corp,

The Baker’s Delight vs. Baker’s Delight

Acme Corporation vs. Apex Corp,

The Baker’s Delight vs. The Delight

The names are similar, but abbreviations and structural differences lower the match confidence to medium.

High (0.95)

Census Data Inc. vs. Census Data,

Incorporated International Business Machines vs. IBM International Business Machines

Census Data Inc. vs. Census Analytics Inc.,

International Business Machines vs. Global Business Machines

The strings are nearly identical, with minor differences, leading to a high confidence match.

When building your match rules, you can choose what matching method you want to use for each column separately. You can mix exact match and fuzzy match rules in the same configuration.

Output Type

Census allows you to either merge duplicate records into one or mark them as duplicates. If marked as duplicate, Census will add additional columns to note its parent ID and if it's a duplicate. In addition, for marked as duplicate mode, when fuzzy match rules are applied, a similarity score column against the parent, is added for every fuzzy match rule.

If your source dataset looks like below, examples of how your deduplicated dataset will look like is shown below.

Source Dataset:

ID
EMAIL
FULL_NAME
PHONE

1

john@example.com

John Doe

123-356-6891

2

john@gmail.com

Johnny Doe

123-356-6891

3

jannet@example.com

Jannet Smith

245-891-9012

Merged Dataset: Fuzzy Match on Full Name with Medium Similarity, Resolve to Longest Email

ID
EMAIL
FULL_NAME
PHONE

1

john@example.com

John Doe

123-356-6891

3

jannet@example.com

Jannet Smith

245-891-9012

Marked as Duplicate Dataset: Fuzzy Match on Full Name with Medium Similarity, Resolve to Longest Email

ID
_census_parent_id
_census_email_similarity_with_parent
_census_has_duplicates
EMAIL
FULL_NAME
PHONE

1

1

1.0

true

john@example.com

John Doe

123-356-6891

2

1

0.96

true

john@gmail.com

Johnny Doe

123-356-6891

3

3

1.0

false

jannet@example.com

Jannet Smith

245-891-9012

Merge Rules

Merge rules help you identify the winning record amongst the duplicates. The ID of the winning record becomes either the primary ID (aka the column configured as the unique ID on your source dataset) when merged, or the _census_parent_id when unmerged, and is useful while syncing back to your business applications.

Census supports waterfall structure rules. So, the first rule is evaluated first and then the next until a record becomes a winning record.

Column Overrides

Column Overrides help you override column values on the winning record. You can conditionally choose values for the final / resolved record.

Default Internal Variables

There are multiple internal variables that determine how Census will run the Entity Resolution algorithm. Below is a list of these parameters and their default values. Contact Census if you would like to change any of these.

  • bands: how many slices of the signature will be used to determine if records are likely to be identical; the default is 4

  • hashes_per_band: the number of hash values from the signature will be used per band; the default is 8

  • use_first_char_blocking: only compare records that start with the same letter during fuzzy matches; default is false

  • use_sorted_neighborhood: enable fuzzy matching through sorting and comparing within a small window; turned on by default

  • use_deletion_key: enables comparing records to those with deleted characters from other records to catch variations with typos or missing letters; tuned on by default

  • number of passes: how many times the algorithm will run on the records; default is once

Materialization

Entity resolution is performed by internal Census syncs that you can view the statuses of on the dataset page. You can also change the frequency at which these syncs run and manually trigger refreshes.

Census supports Deterministic Entity Resolution with at the column level. Deterministic Entity Resolution uses human-defined rules-based approach to identify duplicate records or associated users.

Match Rules are the criteria we use to identify duplicate or associated records. You can define these rules with a number of possible operations including Exact Match and .

Fuzzy matching is a technique used to identify and match similar strings that may not be identical. In Census, we want to ensure you always get a predictable & deterministic match. We use .

use_lsh: enable to find fuzzy matches; turned on by default

Entity Resolution generates a new dataset that is written back to , allowing you to query and sync from it. If you want to bring your own S3 bucket for Census store, checkout our docs .

💾
Jaro-Winkler similarity
Locality Sensitive Hashing
Census Store
here
Fuzzy Match
Fuzzy Match
Census Entity Resolution - Column Overrides