Embedding Columns
Use LLMs to enrich your dataset with vector embeddings, unlocking use cases like semantic search and recommendations
Embedding Columns enable you to generate vector embeddings for each row in your dataset using ChatGPT (OpenAI). Select the column in your dataset you'd like to generate embeddings for, or use liquid templating to combine columns together. This setup allows you to generate text input for each row, with the embedding automatically written back to your Embedding Column.
Example Use Cases for Embedding Columns
Once you've created Embedding Columns, use Census to sync the results into vector database destinations like Pinecone or turbopuffer to unlock:
Semantic search - Find documents or products based on meaning rather than exact keyword matches, so searching "comfortable shoes for running" surfaces relevant results even if they don't contain those exact words.
Recommendation systems - Suggest similar items by finding products, movies, or content with nearby embeddings in vector space, enabling "customers who liked this also liked" features.
Duplicate detection - Identify near-duplicate records in databases by comparing embedding similarity, useful for deduplication and data cleaning.
Clustering and categorization - Automatically group similar documents, support tickets, or customer feedback without manual tagging.
Question answering - Power RAG (Retrieval Augmented Generation) systems that find relevant context from knowledge bases to answer questions accurately.
Anomaly detection - Spot outliers by identifying data points whose embeddings are far from normal patterns, useful for fraud detection or quality control.
Cross-lingual search - Enable searches across multiple languages since embeddings can capture semantic meaning independent of language.
Personalization - Create user preference embeddings to match people with relevant content, products, or experiences based on their behavior patterns.
Pre-requisites
Dataset should have a Unique ID column
An API key to connect a LLM Provider (OpenAI)
To create a new OpenAI API key, log into OpenAI and navigate to Dashboard / API keys and generate a new Project API Key.
How to create an Embedding Column
Step 1: Log into your Census account.
Step 2: Navigate to the Datasets tab by clicking on Datasets
in the left navigation panel.
Step 3: Choose a dataset where you want to add a new AI-based column. Make sure the Dataset has a Unique ID column assigned
Step 4: Select Enrich & Enhance
on your top right corner, choose Embedding
and your preferred LLM provider.

Step 5: Connect to OpenAI using your API Key and click Next.

Step 6: Define the Input to generate embeddings for based on the column names in your dataset.

Step 7: Configure your Embedding parameters in the Advanced Options section
Model Type - you can select from the provided list of models for the selected LLM provider.
Dimensions - The number of dimensions the resulting output embeddings should have, if the chosen model supports configuring dimensions value.
Step 8: Hit the Create button and that's it! Census will generate an Embedding Column in your dataset.
This step can take several minutes. Behind the scene, Census sets up OpenAI as a destination and runs a sync across all your rows in the selected dataset.
Embedding Columns refresh every 6 hours and only process new rows.
Warehouse Writeback
The results generated by Embedding Columns are stored directly in your source warehouse. Census creates a new table within the Census schema, prefixed with DATASET_COLUMN_EMBED_
, containing the Embedding Column.
This allows you to not only sync these Embedding Columns to your vector database destinations like Pinecone and turbopuffer via Census, but also build your own SQL queries on them within your warehouse to do things like mark similar records or detect outliers.
Embedding Columns are currently supported on Snowflake, Redshift, BigQuery, Databricks, and Postgres with more warehouses coming soon.
Rate Limits
For more information, please see the rate limit policies for your specific LLM provider.
Privacy and Security
Census only sends your prompt to the LLM provider. If your prompt includes specific dataset columns via liquid templates, these columns will be included as part of the prompt sent to the LLM provider. No other data is shared with the LLM.
Data sent via Census to the LLM provider is not used for training models. For more information, please refer each LLM provider's data usage policies.
All requests made to the LLM provider are made through secure HTTPS channels, and only successful responses are saved to your dataset.
Last updated
Was this helpful?