Embedding Columns

Use LLMs to enrich your dataset with vector embeddings, unlocking use cases like semantic search and recommendations

Embedding Columns enable you to generate vector embeddings for each row in your dataset using ChatGPT (OpenAI). Select the column in your dataset you'd like to generate embeddings for, or use liquid templating to combine columns together. This setup allows you to generate text input for each row, with the embedding automatically written back to your Embedding Column.

At this time, Census only supports generating Embedding Columns using OpenAI. Reach out to us if you'd like to use other LLM providers, or check out our HTTP Request Enrichments to enrich your dataset with embeddings from any provider you have access to.

Example Use Cases for Embedding Columns

Once you've created Embedding Columns, use Census to sync the results into vector database destinations like Pinecone or turbopuffer to unlock:

  1. Semantic search - Find documents or products based on meaning rather than exact keyword matches, so searching "comfortable shoes for running" surfaces relevant results even if they don't contain those exact words.

  2. Recommendation systems - Suggest similar items by finding products, movies, or content with nearby embeddings in vector space, enabling "customers who liked this also liked" features.

  3. Duplicate detection - Identify near-duplicate records in databases by comparing embedding similarity, useful for deduplication and data cleaning.

  4. Clustering and categorization - Automatically group similar documents, support tickets, or customer feedback without manual tagging.

  5. Question answering - Power RAG (Retrieval Augmented Generation) systems that find relevant context from knowledge bases to answer questions accurately.

  6. Anomaly detection - Spot outliers by identifying data points whose embeddings are far from normal patterns, useful for fraud detection or quality control.

  7. Cross-lingual search - Enable searches across multiple languages since embeddings can capture semantic meaning independent of language.

  8. Personalization - Create user preference embeddings to match people with relevant content, products, or experiences based on their behavior patterns.

Pre-requisites

  • Dataset should have a Unique ID column

  • An API key to connect a LLM Provider (OpenAI)

  • To create a new OpenAI API key, log into OpenAI and navigate to Dashboard / API keys and generate a new Project API Key.

How to create an Embedding Column

Step 1: Log into your Census account.

Step 2: Navigate to the Datasets tab by clicking on Datasets in the left navigation panel.

Step 3: Choose a dataset where you want to add a new AI-based column. Make sure the Dataset has a Unique ID column assigned

Step 4: Select Enrich & Enhance on your top right corner, choose Embedding and your preferred LLM provider.

Step 5: Connect to OpenAI using your API Key and click Next.

Step 6: Define the Input to generate embeddings for based on the column names in your dataset.

Step 7: Configure your Embedding parameters in the Advanced Options section

  • Model Type - you can select from the provided list of models for the selected LLM provider.

  • Dimensions - The number of dimensions the resulting output embeddings should have, if the chosen model supports configuring dimensions value.

Step 8: Hit the Create button and that's it! Census will generate an Embedding Column in your dataset.

This step can take several minutes. Behind the scene, Census sets up OpenAI as a destination and runs a sync across all your rows in the selected dataset.

Embedding Columns refresh every 6 hours and only process new rows.

Warehouse Writeback

The results generated by Embedding Columns are stored directly in your source warehouse. Census creates a new table within the Census schema, prefixed with DATASET_COLUMN_EMBED_, containing the Embedding Column.

This allows you to not only sync these Embedding Columns to your vector database destinations like Pinecone and turbopuffer via Census, but also build your own SQL queries on them within your warehouse to do things like mark similar records or detect outliers.

Rate Limits

Requests made by Census to the LLM provider (ex. OpenAI) are subject to daily rate limits, which may cause the underlying sync to stall. Rate limits can typically be increased by upgrading the tier of your organization with the LLM provider.

For more information, please see the rate limit policies for your specific LLM provider.

Privacy and Security

Census only sends your prompt to the LLM provider. If your prompt includes specific dataset columns via liquid templates, these columns will be included as part of the prompt sent to the LLM provider. No other data is shared with the LLM.

Data sent via Census to the LLM provider is not used for training models. For more information, please refer each LLM provider's data usage policies.

All requests made to the LLM provider are made through secure HTTPS channels, and only successful responses are saved to your dataset.

Last updated

Was this helpful?