Skip to main content

What It Does

The Enrichment Pipeline transforms raw external market data into structured, scored, provenanced product attributes. It solves the cold-start problem: even before users vote or interact, agents have meaningful product intelligence to work with. The pipeline ingests from multiple data source types, extracts attributes via AI-powered methods, resolves entities against your product catalog, normalizes everything into a canonical schema, and aggregates signals into queryable baselines at three different levels.

Why It Matters

Agents need more than product titles and prices to make informed recommendations. The enrichment pipeline provides:
  • Rich product attributes — texture, effectiveness, sentiment, ingredient analysis, certifications, and more — extracted from real-world data
  • Provenance on every signal — agents and users always know where a data point came from and how confident it is
  • Cold-start intelligence — new platforms have useful product data from day one, without waiting for user interactions
  • Multi-level baselines — understand a product’s attributes in context: how it compares to its category and the broader market
  • Continuous discovery — Podium automatically discovers and enriches new products from multiple sources, keeping your catalog fresh
As users interact (vote in campaigns, rank products, purchase), USER_DECLARED signals naturally dominate the intent score. Market-derived enrichment data gracefully steps back to a supplementary validation role.

Schema Overview

The pipeline uses five core data models. These form a reusable blueprint — you can adapt this schema for any product vertical.

EnrichmentSource

Defines where data comes from. Each source has a type and a JSON configuration:
FieldTypePurpose
namestringHuman-readable source name
sourceTypeenumSTRUCTURED_API, WEB_PAGE, COMMUNITY, DISCOVERY
configJSONSource-specific configuration
lastRunAtdatetimeLast ingestion timestamp
lastRunStatusstringSUCCESS or FAILED with error details

RawEnrichmentRecord

Immutable audit log of every piece of ingested data. Raw records are never modified after creation — they serve as the provenance trail.
FieldTypePurpose
sourceIdFKWhich source produced this record
externalIdstringBarcode, URL, or post ID for deduplication
rawDataJSONExact response as received
processedbooleanWhether signals have been extracted

EnrichmentSignal

The core output — individual attribute-value pairs extracted from raw records, resolved to products, and tagged with confidence and provenance.
FieldTypePurpose
sourceIdFKSource that produced this signal
rawRecordIdFKRaw record it was extracted from
productIdFK (nullable)Resolved product, or null if unresolved
externalNamestringProduct name as seen in the source
attributeTypestringCanonical attribute type (see taxonomy below)
attributeValuestringNormalized value
confidencefloat0–1 confidence score
signalSourceenumMARKET_DERIVED for enrichment data
metadataJSONEvidence: text snippets, URLs, ratings, context
A unique constraint on [sourceId, externalName, attributeType, attributeValue] ensures idempotent re-processing.

AttributeBaseline

Aggregated top values per product (or category, or market) per attribute type. This is what the agentic product feed queries.
FieldTypePurpose
productIdFK (nullable)Product this baseline describes (null for category/market baselines)
scopeenumPRODUCT, CATEGORY, or MARKET
attributeTypestringAttribute type (e.g., texture, sentiment)
topValuesJSONArray of { value, score, signalCount }
signalCountintTotal signals contributing to this baseline
signalSourceenumBASELINE_INFERRED
computedAtdatetimeWhen this baseline was computed
validUntildatetimeTTL — recompute after this timestamp

ProductEmbedding

Vector embeddings for semantic product matching during entity resolution.
FieldTypePurpose
productIdFKProduct this embedding represents
embeddingvector(1536)AI-generated embedding for similarity matching

Multi-Level Baselines

Baselines are computed at three scopes, giving agents and developers graduated context for every product attribute:
ScopeWhat It Tells YouExample
ProductThis specific product’s attributes”CeraVe Moisturizing Cream has texture: creamy (0.92), sentiment: positive (0.88)“
CategoryNorms for this product category”Moisturizers typically have texture: lightweight (0.75), key_ingredient: hyaluronic acid (0.68)“
MarketBroad market trends”Across all skincare, sentiment: positive (0.71), certification: cruelty-free (0.54)”
This lets agents make contextual comparisons: “This serum’s effectiveness score is 0.91, which is significantly above the category average of 0.67.”

Scoring Formula

score = average_confidence × log1p(signal_count)
This balances signal quality (confidence) with signal volume (count), using logarithmic scaling to prevent high-volume sources from dominating.

Recomputation Modes

ModeTriggerScope
Single productNew enrichment data ingestedOne product’s baselines
CategoryScheduled recomputeAll products in a category
Stale baselinesScheduled cron jobAll baselines past their validUntil
Full recomputeAdmin actionEvery product with any enrichment signals

Attribute Taxonomy

The pipeline extracts 13 attribute types. This taxonomy is extensible — add new types by updating the canonical type mapping in the normalization layer.
Attribute TypeExample ValuesTypical Sources
categoryface, serums, lip productsAPIs, web data
effectivenesshighly effective, moderate resultsReviews, community
texturelightweight, creamy, gelReviews, descriptions
sentimentpositive, negative, mixedReviews, community
key_ingredienthyaluronic acid, retinol, niacinamideAPIs, descriptions
skin_type_suitabilityoily, dry, sensitive, combinationReviews, APIs
finishmatte, dewy, naturalReviews, descriptions
scentfragrance-free, floral, citrusReviews, APIs
longevity8+ hours, all day, fades quicklyReviews
certificationorganic, cruelty-free, veganAPIs, labels
packagingpump bottle, tube, jarReviews, descriptions
colorsheer, full coverage, tintedReviews, descriptions
brand_tier1 (prestige), 2 (enthusiast), 3 (mainstream), 4 (mass market)Computed from brand positioning
price_tierbudget, mid-range, luxuryComputed from price

Domain Taxonomy

Products are organized into a two-level domain:subcategory system. This taxonomy drives domain-specific scoring weights, expert personas, and brand tier calibration across the platform.
DomainSubcategories
beautymoisturizer, serum, cleanser, sunscreen, toner, mask, exfoliant, eye cream, lip, makeup, hair care, body care, tools, fragrance
wellnesssupplements, probiotics, protein, greens, vitamins, minerals, adaptogens, nootropics, collagen, sleep, longevity, fitness
fashionclothing, shoes, accessories, bags, jewelry, watches
homedecor, kitchen, bedding, bath, storage, lighting, furniture
Each product is tagged with a domain and subcategory at ingestion time. The domain determines which scoring weights, expert personas, and brand tier thresholds are applied during recommendation ranking.

Data Sources

Structured APIs

Structured product databases provide high-quality, deterministic data. Extraction is rule-based (no AI cost):
  • Certifications from label tags (organic, cruelty-free, vegan) — confidence 0.95
  • Categories from category tags — confidence 0.9
  • Key ingredients from ingredient text via pattern matching — confidence 0.9
  • Scent profile from fragrance-free/unscented patterns — confidence 0.85
URL Resolution. The enrichment pipeline resolves product URLs through multiple strategies:
  1. Structured product APIs — retailer URL resolution via direct API lookups
  2. Web search — discovering retailer URLs for products without direct API coverage
  3. Affiliate platform resolution — resolving intermediate redirect URLs (e.g., affiliate tracking links) to real retailer pages
This ensures every product in the catalog has a valid, purchasable URL regardless of how it was originally discovered.

Web Intelligence

Podium ingests product data from the web, extracting structured attributes via configurable schemas. The pipeline processes:
  • Product descriptions → AI-powered attribute extraction
  • Individual reviews → AI extraction with per-review confidence
  • Product images → stored as image signals for catalog display

Community Sources

Community forums and discussion platforms provide unfiltered consumer sentiment:
  • Searches configured communities with product-name queries
  • Fetches qualifying posts and comments
  • Extracts attributes from real user discussions
  • Provides a sentiment layer that complements structured data

Product Discovery

Podium continuously discovers new products relevant to your vertical from multiple sources. Discovered products are automatically:
  1. Matched against your existing catalog via entity resolution
  2. Added as new ProductCatalogItem records if they’re novel
  3. Enriched with the full attribute extraction pipeline
This means your companion’s product catalog grows organically without manual curation.

Creator Storefronts

Source type: CREATOR_STOREFRONT. Podium discovers creators from affiliate platforms, ingests their curated product collections, and enriches the catalog with their selections. Each discovered creator becomes a CreatorPersona with:
  • Resolved product IDs linked to your catalog
  • Creator metadata (handle, display name, platform, avatar)
  • Product count and resolution status
Creator storefronts are a high-signal data source — a creator’s curation reflects expert taste and audience trust, which feeds into recommendation quality.

Wellness Vertical

The pipeline covers wellness products (supplements, probiotics, adaptogens, etc.) with a specialized category taxonomy and domain-specific enrichment. Wellness brand tiers are calibrated differently from beauty:
TierExamplesPositioning
1 (prestige)Thorne, Pure EncapsulationsClinical-grade, practitioner-recommended
2 (enthusiast)Seed, AG1Science-forward, premium DTC
3 (mainstream)Nature MadeWidely available, trusted
4 (mass market)KirklandValue-focused, bulk
Wellness enrichment emphasizes formulation data — bioavailability, clinical evidence, third-party testing, and ingredient sourcing — rather than the texture and finish attributes that dominate beauty.

Attribute Extraction

Rule-Based Path

For structured sources with well-defined fields, extraction uses deterministic pattern matching. Zero AI cost, high consistency.

AI-Powered Path

For unstructured text (reviews, posts, descriptions), extraction uses AI with a structured prompt:
  • Target: 12 attribute types with expected value formats
  • Input: product name + text content + text type
  • Output: JSON array of { attributeType, attributeValue, confidence, evidence }
  • Hard confidence floor: signals below 0.6 are discarded
  • Batch processing available for multiple texts per raw record

Entity Resolution

External product names (“CeraVe Moisturizing Cream 16oz”) must be matched to products in your catalog. The resolver uses a two-tier strategy:

Tier 1: Fuzzy Text Match

  • Compares external names against all product names using trigram similarity
  • Similarity > 0.75 → immediate match
  • Similarity 0.4–0.75 → candidate for Tier 2
  • Cost: zero external API calls

Tier 2: Semantic Embedding Match

  • Triggered only when Tier 1 doesn’t find a strong match
  • Cosine similarity search against pre-computed product embeddings
  • Threshold: 0.82 cosine similarity for acceptance
When resolution fails, the EnrichmentSignal is stored with productId: null. Unresolved signals aren’t lost — they can be retroactively linked when entity resolution improves or when the matching product is added to the catalog.

Normalization

All extracted attributes pass through a multi-phase normalization:
  1. Type canonicalization: Maps variant names to canonical types (e.g., fragrancescent, skin_typeskin_type_suitability, shadecolor)
  2. Value canonicalization: Maps common aliases to standard values (e.g., light weightlightweight, cruelty freecruelty-free)
  3. Confidence clamping: Forces all values to [0, 1]
  4. Deduplication: Removes exact type:value duplicates within a batch

Consuming Enrichment Data

After the pipeline processes enrichment data, agents query the agentic product feed to access enriched product attributes. The feed includes baselines for each product with attribute type, top values, and confidence scores.
import { createPodiumClient } from '@podium-sdk/node-sdk';
const client = createPodiumClient({ apiKey: process.env.PODIUM_API_KEY });

const feed = await client.agentic.listProductsFeed({
  limit: 20,
  categories: 'skincare',
});

for (const product of feed.products) {
  console.log(product.name, product.intentScore);
  if (product.attributes?.baselines) {
    for (const baseline of product.attributes.baselines) {
      console.log(`  ${baseline.attributeType}: ${baseline.topValues[0]?.value} (${baseline.scope})`);
    }
  }
}
The companion recommendations API leverages enrichment data to filter and rank products:
const recs = await client.companion.listRecommendations({
  userId: 'user_123',
  count: 5,
  category: 'skincare',
});

Async Pipeline Orchestration

The pipeline runs asynchronously via Podium’s event system:
QueueTriggerAction
enrichment-crawlAdmin endpoint or scheduled jobRuns data source ingestion, stores raw records
enrichment-extractPublished after ingestion completesAI extraction from raw records, creates signals
enrichment-baselineAdmin endpoint or scheduled cronRecomputes baselines from accumulated signals
All handlers are idempotent — safe to retry on failure. Raw records track a processed flag to prevent duplicate extraction.

Admin Endpoints

MethodPathPurpose
POST/admin/enrichment/runTrigger ingestion for a specific source
POST/admin/enrichment/baseline/recomputeTrigger baseline computation
GET/admin/enrichment/statusDashboard: all sources with last run status and counts
POST/admin/enrichment/sourcesRegister a new enrichment source
GET/admin/enrichment/sourcesList all configured sources
POST/admin/enrichment/crawl-domainIngest products from a domain
POST/admin/enrichment/map-domainDiscover product URLs on a domain
POST/admin/enrichment/discoverTrigger product discovery
Trigger ingestion for a specific source:
curl -X POST https://api.podium.build/api/v1/admin/enrichment/run \
  -H "Authorization: Bearer $PODIUM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "sourceId": "clsource_sephora" }'
Check pipeline status:
curl https://api.podium.build/api/v1/admin/enrichment/status \
  -H "Authorization: Bearer $PODIUM_API_KEY"

Building Your Own Pipeline

The enrichment architecture is designed to be extensible across verticals. To adapt it for your domain:
  1. Define your attribute taxonomy — what product attributes matter in your vertical?
  2. Register data sources — structured APIs for your domain, review platforms, community forums
  3. Configure extraction — rule-based for structured sources, AI-powered for unstructured text
  4. Generate product embeddings — pre-compute embeddings for your catalog to enable entity resolution
  5. Set baseline thresholds — minimum signal count, confidence floor, and TTL for your data freshness requirements
The schema patterns (source → raw record → signal → baseline) and the two-lane provenance model (user-declared vs market-derived) are vertical-agnostic. The taxonomy and data sources are where you customize. A database schema that follows the enrichment pattern:
CREATE TABLE enrichment_source (
  id         TEXT PRIMARY KEY,
  name       TEXT NOT NULL,
  source_type TEXT NOT NULL,  -- STRUCTURED_API, WEB_PAGE, COMMUNITY, DISCOVERY
  config     JSONB,
  last_run_at TIMESTAMPTZ
);

CREATE TABLE enrichment_signal (
  id              TEXT PRIMARY KEY,
  source_id       TEXT REFERENCES enrichment_source(id),
  product_id      TEXT,  -- nullable for unresolved signals
  attribute_type  TEXT NOT NULL,
  attribute_value TEXT NOT NULL,
  confidence      REAL NOT NULL CHECK (confidence BETWEEN 0 AND 1),
  signal_source   TEXT DEFAULT 'MARKET_DERIVED',
  metadata        JSONB,
  UNIQUE (source_id, attribute_type, attribute_value)
);

CREATE TABLE attribute_baseline (
  id             TEXT PRIMARY KEY,
  product_id     TEXT,  -- nullable for category/market baselines
  scope          TEXT NOT NULL,  -- PRODUCT, CATEGORY, MARKET
  attribute_type TEXT NOT NULL,
  top_values     JSONB NOT NULL,
  signal_count   INTEGER NOT NULL,
  computed_at    TIMESTAMPTZ DEFAULT NOW(),
  valid_until    TIMESTAMPTZ NOT NULL,
  UNIQUE (product_id, scope, attribute_type)
);