Embedding Columns

Use embedding models to enrich your dataset with vector embeddings, unlocking use cases like semantic search and recommendations

Embedding Columns enable you to generate vector embeddings for each row in your dataset using ChatGPT (OpenAI) or Gemini (Google). Select the column in your dataset you'd like to generate embeddings for, or use liquid templating to combine columns together. This setup allows you to generate text input for each row, with the embedding automatically written back to your Embedding Column.

At this time, Census only supports generating Embedding Columns using OpenAI and Gemini. Reach out to us if you'd like to use other embedding model providers, or check out our HTTP Request Enrichments to enrich your dataset with embeddings from any provider you have access to.

Example Use Cases for Embedding Columns

Once you've created Embedding Columns, use Census to sync the results into vector database destinations like Pinecone or turbopuffer to unlock:

Semantic search - Find documents or products based on meaning rather than exact keyword matches, so searching "comfortable shoes for running" surfaces relevant results even if they don't contain those exact words.
Recommendation systems - Suggest similar items by finding products, movies, or content with nearby embeddings in vector space, enabling "customers who liked this also liked" features.
Duplicate detection - Identify near-duplicate records in databases by comparing embedding similarity, useful for deduplication and data cleaning.
Clustering and categorization - Automatically group similar documents, support tickets, or customer feedback without manual tagging.
Question answering - Power RAG (Retrieval Augmented Generation) systems that find relevant context from knowledge bases to answer questions accurately.
Anomaly detection - Spot outliers by identifying data points whose embeddings are far from normal patterns, useful for fraud detection or quality control.
Cross-lingual search - Enable searches across multiple languages since embeddings can capture semantic meaning independent of language.
Personalization - Create user preference embeddings to match people with relevant content, products, or experiences based on their behavior patterns.

Pre-requisites

Dataset should have a Unique ID column
An API key to connect a embedding model provider (OpenAI or Gemini)

How to create an Embedding Column

Step 1: Log into your Census account.

Step 2: Navigate to the Datasets tab by clicking on Datasets in the left navigation panel.

Step 3: Choose a dataset where you want to add a new AI-based column. Make sure the Dataset has a Unique ID column assigned

Step 4: Select Enrich & Enhance on your top right corner, choose Embedding and your preferred embedding model provider.

Step 5: Connect to OpenAI or Gemini using your API Key and click Next.

Step 6: Define the Input to generate embeddings for based on the column names in your dataset.

Step 7: Configure your Embedding parameters in the Advanced Options section

Model Type - you can select from the provided list of models for the selected embedding model provider.
Dimensions - The number of dimensions the resulting output embeddings should have, if the chosen model supports configuring dimensions value.

Step 8: Hit the Create button and that's it! Census will generate an Embedding Column in your dataset.

This step can take several minutes. Behind the scene, Census sets up OpenAI or Gemini as a destination and runs a sync across all your rows in the selected dataset.

Embedding Columns refresh every 6 hours and only process new rows.

Warehouse Writeback

The results generated by Embedding Columns are stored directly in your source warehouse. Census creates a new table within the Census schema, prefixed with DATASET_COLUMN_EMBED_, containing the Embedding Column.

This allows you to not only sync these Embedding Columns to your vector database destinations like Pinecone and turbopuffer via Census, but also build your own SQL queries on them within your warehouse to do things like mark similar records or detect outliers.

Embedding Columns are currently supported on Snowflake, Redshift, BigQuery, Databricks, and Postgres with more warehouses coming soon.

Rate Limits

Requests made by Census to the embedding model provider (ex. OpenAI) are subject to daily rate limits, which may cause the underlying sync to stall. Rate limits can typically be increased by upgrading the tier of your organization with the embedding model provider.

For more information, please see the rate limit policies for your specific embedding model provider.

Privacy and Security

Census only sends your input prompt to the embedding model provider. If your input includes specific dataset columns via liquid templates, these columns will be included as part of the input sent to the embedding model provider. No other data is shared with the embedding model.

Data sent via Census to the embedding model provider is not used for training models. For more information, please refer each embedding model provider's data usage policies.

All requests made to the embedding model provider are made through secure HTTPS channels, and only successful responses are saved to your dataset.

PreviousComputed Columns NextOverview

Last updated 1 month ago

Was this helpful?