LogoLogo
  • 🦩Overview
  • 💾Datasets
    • Overview
    • Core Concepts
      • Columns & Annotations
      • Type & Property Mappings
      • Relationships
    • Basic Datasets
      • dbt Integration
      • Sigma Integration
      • Looker Integration
    • SaaS Datasets
    • CSV Datasets
    • Streaming Datasets
    • Entity Resolution
    • AI Columns
      • AI Prompts Recipe Book
    • Enrichment Columns
      • Quick Start
      • HTTP Request Enrichments
    • Computed Columns
    • Version Control
  • 📫Syncs
    • Overview
    • Triggering & Scheduling
    • Retry Handling
    • Live Syncs
    • Audience Syncs
    • Observability
      • Current Sync Run Overview
      • Sync History
      • Sync Tracking
      • API Inspector
      • Sync Alerts
      • Observability Lake
      • Datadog Integration
      • Warehouse Writeback
      • Sync Lifecycle Webhooks
      • Sync Dry Runs
    • Structuring Data
      • Liquid Templates
      • Event Syncs
      • Arrays and Nested Objects
  • 👥Audience Hub
    • Overview
    • Creating Segments
      • Segment Priorities
      • Warehouse-Managed Audiences
    • Experiments and Analysis
      • Audience Match Rates
    • Activating Segments
    • Calculated Columns
    • Data Preparation
      • Profile Explorer
      • Exclusion Lists
  • 🧮Data Sources
    • Overview
    • Available Sources
      • Amazon Athena
      • Amazon Redshift
      • Amazon S3
      • Azure Synapse
      • ClickHouse
      • Confluent Cloud
      • Databricks
      • Elasticsearch
      • Kafka
      • Google AlloyDB
      • Google BigQuery
      • Google Cloud SQL for PostgreSQL
      • Google Pub/Sub
      • Google Sheets
      • Greenplum
      • HTTP Request
      • HubSpot
      • Materialize
      • Microsoft Fabric
      • MotherDuck
      • MySQL
      • PostgreSQL
      • Rockset
      • Salesforce
      • SingleStore
      • Snowflake
      • SQL Server
      • Trino
  • 🛫Destinations
    • Overview
    • Available Destinations
      • Accredible
      • ActiveCampaign
      • Adobe Target
      • Aha
      • Airship
      • Airtable
      • Algolia
      • Amazon Ads DSP (AMC)
      • Amazon DynamoDB
      • Amazon EventBridge
      • Amazon Pinpoint
      • Amazon Redshift
      • Amazon S3
      • Amplitude
      • Anaplan
      • Antavo
      • Appcues
      • Apollo
      • Asana
      • AskNicely
      • Attentive
      • Attio
      • Autopilot Journeys
      • Azure Blob Storage
      • Box
      • Bloomreach
      • Blackhawk
      • Braze
      • Brevo (formerly Sendinblue)
      • Campaign Monitor
      • Canny
      • Channable
      • Chargebee
      • Chargify
      • ChartMogul
      • ChatGPT Retrieval Plugin
      • Chattermill
      • ChurnZero
      • CJ Affiliate
      • CleverTap
      • ClickUp
      • Constant Contact
      • Courier
      • Criteo
      • Crowd.dev
      • Customer.io
      • Databricks
      • Delighted
      • Discord
      • Drift
      • Drip
      • Eagle Eye
      • Emarsys
      • Enterpret
      • Elasticsearch
      • Facebook Ads
      • Facebook Product Catalog
      • Freshdesk
      • Freshsales
      • Front
      • FullStory
      • Gainsight
      • GitHub
      • GitLab
      • Gladly
      • Google Ads
        • Customer Match Lists (Audiences)
        • Offline Conversions
      • Google AlloyDB
      • Google Analytics 4
      • Google BigQuery
      • Google Campaign Manager 360
      • Google Cloud Storage
      • Google Datastore
      • Google Display & Video 360
      • Google Drive
      • Google Search Ads 360
      • Google Sheets
      • Heap.io
      • Help Scout
      • HTTP Request
      • HubSpot
      • Impact
      • Insider
      • Insightly
      • Intercom
      • Iterable
      • Jira
      • Kafka
      • Kevel
      • Klaviyo
      • Kustomer
      • Labelbox
      • LaunchDarkly
      • LinkedIn
      • LiveIntent
      • Loops
      • Mailchimp
      • Mailchimp Transactional (Mandrill)
      • Mailgun
      • Marketo
      • Meilisearch
      • Microsoft Advertising
      • Microsoft Dynamics
      • Microsoft SQL Server
      • Microsoft Teams
      • Mixpanel
      • MoEngage
      • Mongo DB
      • mParticle
      • MySQL
      • NetSuite
      • Notion
      • OneSignal
      • Optimizely
      • Oracle Database
      • Oracle Eloqua
      • Oracle Fusion
      • Oracle Responsys
      • Orbit
      • Ortto
      • Outreach
      • Pardot
      • Partnerstack
      • Pendo
      • Pinterest
      • Pipedrive
      • Planhat
      • PostgreSQL
      • PostHog
      • Postscript
      • Productboard
      • Qualtrics
      • Radar
      • Reddit Ads
      • Rokt
      • RollWorks
      • Sailthru
      • Salesforce
      • Salesforce Commerce Cloud
      • Salesforce Marketing Cloud
      • Salesloft
      • Segment
      • SendGrid
      • Sense
      • SFTP
      • Shopify
      • Singular
      • Slack
      • Snapchat
      • Snowflake
      • Split
      • Sprig
      • Stripe
      • The Trade Desk
      • TikTok
      • Totango
      • Userflow
      • Userpilot
      • Vero Cloud
      • Vitally
      • Webhooks
      • Webflow
      • X Ads (formerly Twitter Ads)
      • Yahoo Ads (DSP)
      • Zendesk
      • Zoho CRM
      • Zuora
    • Custom & Partner Destinations
  • 📎Misc
    • Credits
    • Census Embedded
    • Data Storage
      • Census Store
        • Query Census Store from Snowflake
        • Query Census Store locally using DuckDB
      • General Object Storage
      • Bring Your Own Bucket
        • Bring your own S3 Bucket
        • Bring your own GCS Bucket
        • Bring your own Azure Bucket
    • Developers
      • GitLink
      • Dataset API
      • Custom Destination API
      • Management API
    • Security & Privacy
      • Login & SSO Settings
      • Workspaces
      • Role-based Access Controls
      • Network Access Controls
      • SIEM Log Forwarding
      • Secure Storage of Customer Credentials
      • Digital Markets Act (DMA) Consent for Ad Platforms
    • Health and Usage Reporting
      • Workspace Homepage
      • Product Usage Dashboard
      • Observability Toolkit
      • Alerts
    • FAQs
Powered by GitBook
On this page
  • Supported File Structure
  • Creating an S3 Connection
  • CSV Processing Modes
  • Notes
  • Need help connecting to S3?

Was this helpful?

  1. Data Sources
  2. Available Sources

Amazon S3

This page describes how to add AWS S3 as a source to Census.

PreviousAmazon RedshiftNextAzure Synapse

Last updated 2 months ago

Was this helpful?

The Amazon S3 destination enables treating CSV files like tables from other databases, without having to load them first. To support this, the S3 source includes a basic indexing process that will manage the data available to Census.

Supported File Structure

Census will automatically scan and index your S3 bucket starting at the prefix. CSV files must be:

  • Uncompressed, UTF-8 encoded, comma delimited CSV with .csv file extension

  • Include a header row (Census will use headers as source column names)

  • Represent null values with the string #N/A

  • Values that may include commas should be wrapped by double quotes. Values that include double quotes should be escaped with a second double quote. For example, to send the value Hello, "World", the row in the csv would look like othercolumn,"Hello, ""World""", othercolumn.

Census treats each folder path as a unique data source and considers the file name to be a version of the data within that path. When showing available data sources, the Census UI will only list folder paths under the provided prefix that have CSV files in them. During syncs, Census will always select the newest data in the selected folder path based on timestamp. S3 Sources use the to store an external snapshot of source data.

When updating data in S3, you can either replace the existing file or add a newer version. Census will use the new data and perform an incremental sync if possible (keeping the previous version in S3 is not necessary to enable incremental updates).

Creating an S3 Connection

Census uses role-based authentication to connect to your bucket, as . This involves a three-step "handshake" between your AWS account and Census.

These examples use the AWS CLI, but you can use the AWS Console, API, Terraform, or other tools instead to accomplish the same setup tasks.

  1. Create an IAM role in your AWS account that provides read-only access to your S3 bucket and prefix for Census's AWS Account ID (341876425553). Throughout this example, we'll assume that your bucket name is census-docs-example, your region is us-east-1, and your prefix is data/:

    aws iam create-role \
      --role-name CensusReadOnlyToS3 \
      --assume-role-policy-document '{
        "Version": "2012-10-17",
        "Statement": [{
          "Effect": "Allow",
          "Principal": {"AWS": "arn:aws:iam::341876425553:root"},
          "Action": "sts:AssumeRole"
        }]
      }'

    This will return a JSON document containing an Arn for the role you just created - something like "arn:aws:iam::341876425553:role/CensusReadOnlyToS3". Keep track of this role ARN because you will need it throughout the rest of the setup process.

  2. Grant your newly-created role read-only access to your S3 bucket. We'll do this using an inline policy, but there are many ways to manage permissions in AWS IAM, so choose the appropriate technique for your organization's needs.

    aws iam put-role-policy \
      --role-name CensusReadOnlyToS3 \
      --policy-name CensusReadOnlyToS3 \
      --policy-document '{
        "Version": "2012-10-17",
        "Statement": [{
          "Effect": "Allow",
          "Action": [
            "s3:ListBucket",
            "s3:GetObject",
            "s3:GetObjectVersion",
            "s3:GetBucketLocation"
          ],
          "Resource": [
            "arn:aws:s3:::census-docs-example",
            "arn:aws:s3:::census-docs-example/*"
          ]
        }]
      }'

    It's possible to further restrict this role to limit Census' access by object prefix, though we recommend using a dedicated bucket for sharing data with Census to keep roles and permissions simple.

  3. Navigate to the Census and click "New Source", then select "S3" in the menu.

  4. Provide the Region, Bucket Name, Role ARN, and Prefix to Census, and click "Connect".

  5. Census will generate a unique external ID (${CENSUS_EXTERNAL_ID}) and display it on the next screen. It helps secure the connection between your role and the Census AWS account. Update your role to grant access to the Census AWS Account (instead of your own) and to require this external ID for all API calls:

    aws iam update-assume-role-policy \
      --role-name CensusReadOnlyToS3 \
      --policy-document '{
        "Version": "2012-10-17",
        "Statement": [{
          "Effect": "Allow",
          "Principal": {"AWS": "arn:aws:iam::341876425553:root"},
          "Action": "sts:AssumeRole",
          "Condition": {"StringEquals": {"sts:ExternalId": "${CENSUS_EXTERNAL_ID}"}}
        }]
      }'
  6. Click the "Test" button in the Census UI to verify that the connection has been configured successfully.

CSV Processing Modes

We currently support two modes of operation when reading CSVs from your S3 bucket:

  • Most Recent (default) - We only consider the most recent file in the configured S3 prefix & folder group. That single file is interpreted as the entire dataset. This supports S3 use cases where a single file is being overwritten, or where a new version of a file is being added over time.

  • Merge All - We take every file in the configured S3 prefix & folder group, and merge them all together into a single dataset. We'll interpret the dataset as every row from every file we find in the configured prefix. This supports S3 use cases where the source data set has been split up into multiple files. As additional files are added, they will be merged as well. To delete data from the dataset, either the row must be removed from a file or a file deleted completely. Note: In this mode, all CSVs must have the exact same set of columns names and order.

Notes

  • Because S3 does not support SQL queries, you cannot create SQL, dbt, or Looker models on S3 sources, nor can you use S3 sources with Census Entities or Segments.

  • S3 Sources do not currently support the sync tester.

Need help connecting to S3?

via support@getcensus.com or start a conversation with us via the chat.

🧮
Basic Sync Engine
recommended by AWS
Sources tab
Contact us
in-app