Semantic Mapping

The ESFC Glossary uses advanced AI-powered semantic mapping to connect terms across 10 different food and LCA vocabularies, enabling intelligent cross-source term matching and relationship discovery.

Overview

Semantic mapping creates relationships between terms from different sources (FoodEx2, Hestia, Ecoinvent, etc.) using a sophisticated 4-stage matching cascade that combines exact matching, synonym detection, and AI embeddings.

Key Features:

AI-Powered Matching - OpenAI and Google AI embeddings
4-Stage Cascade - Multiple matching strategies with fallbacks
Quality Validation - Confidence scoring and match quality analysis
Interactive Debugging - Real-time match visualization
Zero Configuration - Falls back to mock mode without API keys

Matching Strategy

4-Stage Cascade

The semantic matching system uses a cascading approach, trying increasingly sophisticated methods:

Stage 1: Contextual Matching (Highest Confidence)
    ↓ (if no match)
Stage 2: Exact Name Matching
    ↓ (if no match)
Stage 3: Synonym Matching
    ↓ (if no match)
Stage 4: Semantic Embedding Search

Each stage has different confidence levels and use cases.

Stage 1: Contextual Matching

Method: Combines name + category for matching Confidence: 0.95 - 1.0 (Very High) Use Case: When both name and category context align

Algorithm:

function contextualMatch(term: Term, targetTerms: Term[]): Match | null {
  const searchKey = `${term.name} ${term.category}`.toLowerCase()

  for (const target of targetTerms) {
    const targetKey = `${target.name} ${target.category}`.toLowerCase()

    if (searchKey === targetKey) {
      return {
        sourceId: term.id,
        targetId: target.id,
        confidence: 1.0,
        method: 'contextual'
      }
    }
  }
  return null
}

Example:

Source: FoodEx2 "Apple" (category: "Fruits")
Target: Hestia "Apple" (category: "Inputs & Products")
Match: ✅ Contextual (confidence: 1.0)

Stage 2: Exact Name Matching

Method: Exact string match on normalized names Confidence: 0.85 - 0.95 (High) Use Case: Identical names across different sources

Normalization:

Lowercase conversion
Trim whitespace
Remove special characters
Handle plurals (optional)

Algorithm:

function exactMatch(term: Term, targetTerms: Term[]): Match | null {
  const normalized = normalizeName(term.name)

  for (const target of targetTerms) {
    if (normalized === normalizeName(target.name)) {
      return {
        sourceId: term.id,
        targetId: target.id,
        confidence: 0.9,
        method: 'exact'
      }
    }
  }
  return null
}

function normalizeName(name: string): string {
  return name.toLowerCase().trim().replace(/[^a-z0-9\s]/g, '')
}

Example:

Source: "Wheat grain"
Target: "wheat grain"
Match: ✅ Exact (confidence: 0.9)

Stage 3: Synonym Matching

Method: Built-in synonym dictionary Confidence: 0.70 - 0.85 (Medium-High) Use Case: Known alternative names and common variations

Synonym Dictionary:

const SYNONYMS = {
  'beef': ['cattle meat', 'bovine meat'],
  'pork': ['pig meat', 'swine meat'],
  'milk': ['dairy milk', 'cow milk'],
  'wheat': ['common wheat', 'bread wheat'],
  'rice': ['paddy rice', 'rice grain'],
  'CO₂': ['carbon dioxide', 'co2 emission'],
  'CH4': ['methane', 'methane emission'],
  'N2O': ['nitrous oxide', 'n2o emission']
}

Algorithm:

function synonymMatch(term: Term, targetTerms: Term[]): Match | null {
  const termSynonyms = getSynonyms(term.name)

  for (const target of targetTerms) {
    const targetNormalized = normalizeName(target.name)

    for (const synonym of termSynonyms) {
      if (normalizeName(synonym) === targetNormalized) {
        return {
          sourceId: term.id,
          targetId: target.id,
          confidence: 0.8,
          method: 'synonym',
          matchedSynonym: synonym
        }
      }
    }
  }
  return null
}

Example:

Source: "Beef"
Target: "Cattle meat"
Match: ✅ Synonym (confidence: 0.8, via "cattle meat")

Stage 4: Semantic Embedding Search

Method: AI-powered vector similarity using embeddings Confidence: 0.50 - 0.70 (Medium) Use Case: Semantic similarity when exact matches fail

AI Providers:

OpenAI (Recommended)
- Model: text-embedding-3-small
- Dimensions: 1536
- Cost-effective and accurate
Google Generative AI (Alternative)
- Model: text-embedding-004
- Dimensions: 768
- Good alternative to OpenAI
Mock Mode (Fallback)
- Deterministic string-based embeddings
- No API key required
- Suitable for testing

Algorithm:

async function semanticMatch(
  term: Term,
  targetTerms: Term[],
  provider: 'openai' | 'google'
): Promise<Match | null> {
  // Generate embedding for source term
  const sourceEmbedding = await generateEmbedding(
    `${term.name} ${term.category}`,
    provider
  )

  let bestMatch: Match | null = null
  let highestSimilarity = 0

  for (const target of targetTerms) {
    // Generate embedding for target term
    const targetEmbedding = await generateEmbedding(
      `${target.name} ${target.category}`,
      provider
    )

    // Calculate cosine similarity
    const similarity = cosineSimilarity(sourceEmbedding, targetEmbedding)

    if (similarity > highestSimilarity && similarity > 0.5) {
      highestSimilarity = similarity
      bestMatch = {
        sourceId: term.id,
        targetId: target.id,
        confidence: similarity,
        method: 'semantic',
        similarity: similarity
      }
    }
  }

  return bestMatch
}

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0)
  const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0))
  const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0))
  return dotProduct / (magnitudeA * magnitudeB)
}

Example:

Source: "Grass-fed beef cattle"
Target: "Extensive pasture cattle production"
Embedding Similarity: 0.72
Match: ✅ Semantic (confidence: 0.72)

AI Provider Integration

OpenAI Configuration

# Set API key
export OPENAI_API_KEY="sk-..."

# Run semantic matching
npm run match-glossaries

OpenAI Features:

Latest embedding models
High accuracy for food/LCA terminology
Reasonable API costs
Fast response times

Example API Call:

import OpenAI from 'openai'

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
})

async function getEmbedding(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
    dimensions: 1536
  })
  return response.data[0].embedding
}

Google AI Configuration

# Set API key
export GOOGLE_API_KEY="AIza..."

# Run semantic matching with Google
npm run match-glossaries

Google AI Features:

Alternative to OpenAI
Good multilingual support
Competitive pricing
Reliable embeddings

Example API Call:

import { GoogleGenerativeAI } from '@google/generative-ai'

const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!)

async function getEmbedding(text: string): Promise<number[]> {
  const model = genAI.getGenerativeModel({ model: 'text-embedding-004' })
  const result = await model.embedContent(text)
  return result.embedding.values
}

Mock Mode (No API Keys)

# Run without API keys (mock embeddings)
npm run match-glossaries:mock

Mock Mode Features:

Deterministic string-based embeddings
No API costs
Suitable for testing
Reproducible results

Mock Algorithm:

function mockEmbedding(text: string, dimensions: number = 1536): number[] {
  const embedding = new Array(dimensions).fill(0)

  for (let i = 0; i < text.length; i++) {
    const charCode = text.charCodeAt(i)
    embedding[i % dimensions] += charCode / 1000
  }

  // Normalize
  const magnitude = Math.sqrt(
    embedding.reduce((sum, val) => sum + val * val, 0)
  )
  return embedding.map(val => val / magnitude)
}

Database Integration

Semantic mappings can be stored in PostgreSQL with pgvector for efficient similarity search:

pgvector Setup

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create embeddings table
CREATE TABLE term_embeddings (
  id SERIAL PRIMARY KEY,
  term_id VARCHAR(255) NOT NULL,
  term_name TEXT NOT NULL,
  term_source VARCHAR(50) NOT NULL,
  embedding vector(1536),
  created_at TIMESTAMP DEFAULT NOW()
);

-- Create index for similarity search
CREATE INDEX ON term_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Find similar terms using cosine distance
SELECT
  term_id,
  term_name,
  term_source,
  1 - (embedding <=> query_embedding) as similarity
FROM term_embeddings
WHERE term_source != 'source_to_exclude'
ORDER BY embedding <=> query_embedding
LIMIT 10;

Embedding Storage

import { Pool } from 'pg'

async function storeEmbedding(
  termId: string,
  termName: string,
  source: string,
  embedding: number[]
): Promise<void> {
  const pool = new Pool({
    connectionString: process.env.DATABASE_URL
  })

  await pool.query(
    `INSERT INTO term_embeddings (term_id, term_name, term_source, embedding)
     VALUES ($1, $2, $3, $4)`,
    [termId, termName, source, `[${embedding.join(',')}]`]
  )
}

Quality Validation

Confidence Scoring

Each match is assigned a confidence score based on the matching method:

Method	Confidence Range	Quality
Contextual	0.95 - 1.0	Excellent
Exact	0.85 - 0.95	Very Good
Synonym	0.70 - 0.85	Good
Semantic	0.50 - 0.70	Fair

Confidence Thresholds:

High confidence (≥ 0.85): Automatic acceptance
Medium confidence (0.70 - 0.84): Review recommended
Low confidence (< 0.70): Manual review required

Match Quality Indicators

Good Matches:

✅ High confidence score (≥ 0.85)
✅ Similar categories
✅ Same domain (food, LCA, packaging)
✅ Consistent descriptions

Questionable Matches:

⚠️ Medium confidence (0.70 - 0.84)
⚠️ Different categories
⚠️ Cross-domain mapping
⚠️ Partial name overlap

Poor Matches:

❌ Low confidence (< 0.70)
❌ Unrelated categories
❌ Different domains
❌ Semantic mismatch

Validation Methods

Automated Validation:

function validateMatch(match: Match): ValidationResult {
  const issues: string[] = []

  // Check confidence threshold
  if (match.confidence < 0.5) {
    issues.push('Confidence below threshold')
  }

  // Check category consistency
  if (match.sourceCategory !== match.targetCategory) {
    issues.push('Category mismatch')
  }

  // Check domain alignment
  if (!domainsAlign(match.sourceDomain, match.targetDomain)) {
    issues.push('Domain mismatch')
  }

  return {
    valid: issues.length === 0,
    confidence: match.confidence,
    issues: issues,
    recommendation: issues.length === 0 ? 'accept' : 'review'
  }
}

Manual Review:

Export matches to CSV/Excel
Review low-confidence matches
Verify cross-domain mappings
Document validation decisions

Performance Optimization

Batch Processing

Process multiple terms in batches to reduce API calls:

async function batchEmbeddings(
  texts: string[],
  batchSize: number = 100
): Promise<number[][]> {
  const batches: number[][][] = []

  for (let i = 0; i < texts.length; i += batchSize) {
    const batch = texts.slice(i, i + batchSize)
    const embeddings = await Promise.all(
      batch.map(text => generateEmbedding(text))
    )
    batches.push(embeddings)
  }

  return batches.flat()
}

Caching Strategy

Cache embeddings to avoid redundant API calls:

import fs from 'fs'

class EmbeddingCache {
  private cache: Map<string, number[]>
  private cacheFile: string

  constructor(cacheFile: string) {
    this.cacheFile = cacheFile
    this.cache = this.loadCache()
  }

  async getEmbedding(text: string, provider: string): Promise<number[]> {
    const cacheKey = `${provider}:${text}`

    if (this.cache.has(cacheKey)) {
      return this.cache.get(cacheKey)!
    }

    const embedding = await generateEmbedding(text, provider)
    this.cache.set(cacheKey, embedding)
    this.saveCache()

    return embedding
  }

  private loadCache(): Map<string, number[]> {
    if (fs.existsSync(this.cacheFile)) {
      const data = JSON.parse(fs.readFileSync(this.cacheFile, 'utf8'))
      return new Map(Object.entries(data))
    }
    return new Map()
  }

  private saveCache(): void {
    const data = Object.fromEntries(this.cache)
    fs.writeFileSync(this.cacheFile, JSON.stringify(data, null, 2))
  }
}

Rate Limiting

Respect API rate limits:

import pLimit from 'p-limit'

const limit = pLimit(10) // Max 10 concurrent requests

async function matchWithRateLimit(
  sourceTerms: Term[],
  targetTerms: Term[]
): Promise<Match[]> {
  const matches = await Promise.all(
    sourceTerms.map(term =>
      limit(() => findBestMatch(term, targetTerms))
    )
  )

  return matches.filter(m => m !== null) as Match[]
}

Use Cases

Cross-Vocabulary Mapping

FoodEx2 to Hestia:

FoodEx2: A010101 (Common wheat)
    ↓ semantic matching
Hestia: term/crop-wheat
    ↓ provides
Environmental impact data

Ecoinvent to Eaternity:

Ecoinvent: market for wheat grain | GLO
    ↓ semantic matching
Eaternity: FlowNode.product_name = "Wheat grain"
    ↓ enables
EOS carbon footprint calculation

User Data Import

CSV Column Header Matching:

User Header: "Produktname" (German)
    ↓ semantic matching
Eaternity Property: eaternity-property-productname
    ↓ maps to
EOS API Field: FlowNode.product_name

Research Applications

Multi-Source LCA:

Research Question: "Carbon footprint of organic beef"
    ↓ semantic matching
FoodEx2: F010101 (Beef) + Organic facet
Hestia: term/livestock-cattle-organic
Ecoinvent: cattle for slaughtering, organic | CH
    ↓ combines
Comprehensive LCA with multiple data sources

Interactive Debugging

Web Interface Features

The ESFC Glossary website provides interactive debugging tools:

Match Visualization:

Real-time match quality display
Confidence score visualization
Method indicator (contextual, exact, synonym, semantic)
Source/target term comparison

Debugging Tools:

Match explanation (why this match was selected)
Alternative matches (other potential matches)
Similarity scores for semantic matches
Category and domain comparison

Export Options:

Export matches to CSV
Download relationship graph
Generate mapping report
Save for manual review

Running Semantic Matching

Command Line

# Full semantic matching (production)
npm run match-glossaries

# Test mode (sample data)
npm run match-glossaries:test

# Mock mode (no API keys)
npm run match-glossaries:mock

# Specific source pairs
node scripts/glossary-matcher.js \
  --source foodex2 \
  --target hestia \
  --output mappings.json

Configuration

// scripts/glossary-matcher.js configuration
const config = {
  provider: 'openai', // or 'google'
  verbose: true,
  mockMode: false,
  confidenceThreshold: 0.5,
  maxResults: 10,
  batchSize: 100,
  cacheFile: './cache/embeddings.json'
}

Output Formats

JSON:

{
  "matches": [
    {
      "sourceId": "foodex2-A010101",
      "sourceName": "Common wheat",
      "targetId": "hestia-term-crop-wheat",
      "targetName": "Wheat crop",
      "confidence": 0.95,
      "method": "contextual",
      "validated": true
    }
  ],
  "statistics": {
    "totalSourceTerms": 31601,
    "totalTargetTerms": 36044,
    "matchesFound": 15823,
    "averageConfidence": 0.82,
    "methodBreakdown": {
      "contextual": 8934,
      "exact": 4521,
      "synonym": 1876,
      "semantic": 492
    }
  }
}

Relationship Graph (JSON-LD):

{
  "@context": "http://www.w3.org/2004/02/skos/core#",
  "@graph": [
    {
      "@id": "foodex2:A010101",
      "@type": "Concept",
      "prefLabel": "Common wheat",
      "exactMatch": "hestia:term-crop-wheat",
      "relatedMatch": "ecoinvent:market-wheat-grain"
    }
  ]
}

Best Practices

Matching Strategy

Start with High-Confidence Methods
- Rely on contextual and exact matches when possible
- Use semantic search as fallback
- Validate low-confidence matches manually
Domain-Specific Matching
- Match food terms to food sources (FoodEx2, Hestia)
- Match LCA processes to ecoinvent
- Match properties to Eaternity schema
Quality Over Quantity
- Prefer fewer high-quality matches
- Review and validate questionable matches
- Document mapping decisions

Performance Optimization

Batch Processing
- Process terms in batches
- Use concurrent requests (with limits)
- Cache embeddings
Incremental Updates
- Only re-match changed terms
- Maintain match history
- Track validation status
Resource Management
- Monitor API usage and costs
- Use mock mode for development
- Cache embeddings locally

Data Sources - Overview of all 10 sources
Eaternity Schema - Property matching
Data Formats - Export relationship data
FoodEx2 Reference - Food classification mapping
Hestia Reference - LCA data mapping

Overview​

Matching Strategy​

4-Stage Cascade​

Stage 1: Contextual Matching​

Stage 2: Exact Name Matching​

Stage 3: Synonym Matching​

Stage 4: Semantic Embedding Search​

AI Provider Integration​

OpenAI Configuration​

Google AI Configuration​

Mock Mode (No API Keys)​

Database Integration​

pgvector Setup​

Embedding Storage​

Quality Validation​

Confidence Scoring​

Match Quality Indicators​

Validation Methods​

Performance Optimization​

Batch Processing​

Caching Strategy​

Rate Limiting​

Use Cases​

Cross-Vocabulary Mapping​

User Data Import​

Research Applications​

Interactive Debugging​

Web Interface Features​

Running Semantic Matching​

Command Line​

Configuration​

Output Formats​

Best Practices​

Matching Strategy​

Performance Optimization​

Related Documentation​

Overview

Matching Strategy

4-Stage Cascade

Stage 1: Contextual Matching

Stage 2: Exact Name Matching

Stage 3: Synonym Matching

Stage 4: Semantic Embedding Search

AI Provider Integration

OpenAI Configuration

Google AI Configuration

Mock Mode (No API Keys)

Database Integration

pgvector Setup

Embedding Storage

Quality Validation

Confidence Scoring

Match Quality Indicators

Validation Methods

Performance Optimization

Batch Processing

Caching Strategy

Rate Limiting

Use Cases

Cross-Vocabulary Mapping

User Data Import

Research Applications

Interactive Debugging

Web Interface Features

Running Semantic Matching

Command Line

Configuration

Output Formats

Best Practices

Matching Strategy

Performance Optimization

Related Documentation