Streaming Datasets
Streaming Datasets allow you to define the data you work with in streaming sources that organize their data into topics to which new messages are published over time.
Streaming datasets enable real-time data activation in Census, allowing you to sync data with low latency to your business applications. Unlike traditional batch-based datasets, streaming datasets continuously process data as it arrives, making them ideal for time-sensitive use cases.
Getting Started
Connect a Streaming Source
Before you create a Streaming Dataset, you’ll need to connect a streaming source. See individual source documentation:
Once your streaming source is connected, open Datasets and click New Dataset. Select Streaming Dataset and click Next.
Create a Streaming Dataset
To create a new streaming dataset:
Navigate to the Datasets section in Census
Click "New Dataset" and select "Streaming Dataset"
Choose your streaming connection (Kafka, Confluent Cloud, etc.)
Add a Sample Message (see below)
Add a Sample Message
In the code editor on the right side of the new dataset dialog, enter a sample JSON message for your new Streaming Dataset. Census will use the sample message you provide to infer the fields and data types of the messages in your dataset.
Census does not treat sample messages as customer data, and they are stored with the rest of your organization’s metadata in Census’s US-based control plane. Do not include real customer data or PII in your sample message.
When you are finished configuring your new Streaming Dataset, click Create Dataset.
Key Features of Streaming Datasets
Real-time Processing
Streaming datasets process data continuously as it arrives, enabling:
Ultra-low latency from data creation to activation
Immediate reactions to customer behavior or system events
Real-time personalization and engagement opportunities
Event-based Architecture
Unlike traditional datasets that operate on tables or views, streaming datasets work with events:
On-the-fly Transformations with Liquid Templates
When activating streaming datasets through syncs, you can apply real-time transformations using Liquid Templates. This powerful feature allows you to:
Transform data structure and format on-the-fly without modifying the source
Apply conditional logic to determine what data gets sent to destinations
Format dates, numbers, and strings to match destination requirements
Combine multiple fields or extract specific parts of fields
Create dynamic content based on event properties
For example, you can use a Liquid Template to create a personalized message from a streaming event:
These transformations happen in real-time as events flow through Census, allowing you to customize and enrich your data without adding latency to your streaming pipeline.
Use Cases for Streaming Datasets
Live Syncs
Streaming datasets are designed to work with Census Live Syncs, which provide continuous data activation:
Always On - Census monitors your streaming source 24/7
Immediate Processing - Data is processed and synced as soon as it arrives
Efficient Resource Usage - Only changed data is processed and synced
Best Practices for Streaming Datasets
Working with streaming data is different from batch processing. Here are some tips to help you get the most out of your streaming datasets:
Include timestamps in your events to ensure proper ordering and enable time-based processing
Add unique identifiers to each event to prevent duplicate processing and enable idempotent operations
Filter events at the source when possible to reduce unnecessary data transfer and processing
Keep your events focused by including only the data you need for your use cases
Set up monitoring for stream lag and processing delays to catch issues early
Create alerts within Census to detect anomalies
Plan for schema evolution by designing flexible event structures that can accommodate new fields
Use Liquid Templates for on-the-fly transformations rather than modifying your event sources
Remember that streaming datasets are designed for real-time use cases. If you don't need sub-minute latency, consider using basic datasets with scheduled syncs instead, which may be more cost-effective for your use case.
For more information on using streaming datasets with Live Syncs, see our Live Syncs documentation.
Last updated
Was this helpful?