Semantic Mapping
The ESFC Glossary uses advanced AI-powered semantic mapping to connect terms across 10 different food and LCA vocabularies, enabling intelligent cross-source term matching and relationship discovery.
Overview
Semantic mapping creates relationships between terms from different sources (FoodEx2, Hestia, Ecoinvent, etc.) using a sophisticated 4-stage matching cascade that combines exact matching, synonym detection, and AI embeddings.
Key Features:
- AI-Powered Matching - OpenAI and Google AI embeddings
- 4-Stage Cascade - Multiple matching strategies with fallbacks
- Quality Validation - Confidence scoring and match quality analysis
- Interactive Debugging - Real-time match visualization
- Zero Configuration - Falls back to mock mode without API keys
Matching Strategy
4-Stage Cascade
The semantic matching system uses a cascading approach, trying increasingly sophisticated methods:
Stage 1: Contextual Matching (Highest Confidence)
↓ (if no match)
Stage 2: Exact Name Matching
↓ (if no match)
Stage 3: Synonym Matching
↓ (if no match)
Stage 4: Semantic Embedding Search
Each stage has different confidence levels and use cases.
Stage 1: Contextual Matching
Method: Combines name + category for matching Confidence: 0.95 - 1.0 (Very High) Use Case: When both name and category context align
Algorithm:
function contextualMatch(term: Term, targetTerms: Term[]): Match | null {
const searchKey = `${term.name} ${term.category}`.toLowerCase()
for (const target of targetTerms) {
const targetKey = `${target.name} ${target.category}`.toLowerCase()
if (searchKey === targetKey) {
return {
sourceId: term.id,
targetId: target.id,
confidence: 1.0,
method: 'contextual'
}
}
}
return null
}
Example:
Source: FoodEx2 "Apple" (category: "Fruits")
Target: Hestia "Apple" (category: "Inputs & Products")
Match: ✅ Contextual (confidence: 1.0)
Stage 2: Exact Name Matching
Method: Exact string match on normalized names Confidence: 0.85 - 0.95 (High) Use Case: Identical names across different sources
Normalization:
- Lowercase conversion
- Trim whitespace
- Remove special characters
- Handle plurals (optional)
Algorithm:
function exactMatch(term: Term, targetTerms: Term[]): Match | null {
const normalized = normalizeName(term.name)
for (const target of targetTerms) {
if (normalized === normalizeName(target.name)) {
return {
sourceId: term.id,
targetId: target.id,
confidence: 0.9,
method: 'exact'
}
}
}
return null
}
function normalizeName(name: string): string {
return name.toLowerCase().trim().replace(/[^a-z0-9\s]/g, '')
}
Example:
Source: "Wheat grain"
Target: "wheat grain"
Match: ✅ Exact (confidence: 0.9)
Stage 3: Synonym Matching
Method: Built-in synonym dictionary Confidence: 0.70 - 0.85 (Medium-High) Use Case: Known alternative names and common variations
Synonym Dictionary:
const SYNONYMS = {
'beef': ['cattle meat', 'bovine meat'],
'pork': ['pig meat', 'swine meat'],
'milk': ['dairy milk', 'cow milk'],
'wheat': ['common wheat', 'bread wheat'],
'rice': ['paddy rice', 'rice grain'],
'CO₂': ['carbon dioxide', 'co2 emission'],
'CH4': ['methane', 'methane emission'],
'N2O': ['nitrous oxide', 'n2o emission']
}
Algorithm:
function synonymMatch(term: Term, targetTerms: Term[]): Match | null {
const termSynonyms = getSynonyms(term.name)
for (const target of targetTerms) {
const targetNormalized = normalizeName(target.name)
for (const synonym of termSynonyms) {
if (normalizeName(synonym) === targetNormalized) {
return {
sourceId: term.id,
targetId: target.id,
confidence: 0.8,
method: 'synonym',
matchedSynonym: synonym
}
}
}
}
return null
}
Example:
Source: "Beef"
Target: "Cattle meat"
Match: ✅ Synonym (confidence: 0.8, via "cattle meat")
Stage 4: Semantic Embedding Search
Method: AI-powered vector similarity using embeddings Confidence: 0.50 - 0.70 (Medium) Use Case: Semantic similarity when exact matches fail
AI Providers:
-
OpenAI (Recommended)
- Model:
text-embedding-3-small - Dimensions: 1536
- Cost-effective and accurate
- Model:
-
Google Generative AI (Alternative)
- Model:
text-embedding-004 - Dimensions: 768
- Good alternative to OpenAI
- Model:
-
Mock Mode (Fallback)
- Deterministic string-based embeddings
- No API key required
- Suitable for testing
Algorithm:
async function semanticMatch(
term: Term,
targetTerms: Term[],
provider: 'openai' | 'google'
): Promise<Match | null> {
// Generate embedding for source term
const sourceEmbedding = await generateEmbedding(
`${term.name} ${term.category}`,
provider
)
let bestMatch: Match | null = null
let highestSimilarity = 0
for (const target of targetTerms) {
// Generate embedding for target term
const targetEmbedding = await generateEmbedding(
`${target.name} ${target.category}`,
provider
)
// Calculate cosine similarity
const similarity = cosineSimilarity(sourceEmbedding, targetEmbedding)
if (similarity > highestSimilarity && similarity > 0.5) {
highestSimilarity = similarity
bestMatch = {
sourceId: term.id,
targetId: target.id,
confidence: similarity,
method: 'semantic',
similarity: similarity
}
}
}
return bestMatch
}
function cosineSimilarity(a: number[], b: number[]): number {
const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0)
const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0))
const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0))
return dotProduct / (magnitudeA * magnitudeB)
}
Example:
Source: "Grass-fed beef cattle"
Target: "Extensive pasture cattle production"
Embedding Similarity: 0.72
Match: ✅ Semantic (confidence: 0.72)
AI Provider Integration
OpenAI Configuration
# Set API key
export OPENAI_API_KEY="sk-..."
# Run semantic matching
npm run match-glossaries
OpenAI Features:
- Latest embedding models
- High accuracy for food/LCA terminology
- Reasonable API costs
- Fast response times
Example API Call:
import OpenAI from 'openai'
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
})
async function getEmbedding(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
dimensions: 1536
})
return response.data[0].embedding
}
Google AI Configuration
# Set API key
export GOOGLE_API_KEY="AIza..."
# Run semantic matching with Google
npm run match-glossaries
Google AI Features:
- Alternative to OpenAI
- Good multilingual support
- Competitive pricing
- Reliable embeddings
Example API Call:
import { GoogleGenerativeAI } from '@google/generative-ai'
const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!)
async function getEmbedding(text: string): Promise<number[]> {
const model = genAI.getGenerativeModel({ model: 'text-embedding-004' })
const result = await model.embedContent(text)
return result.embedding.values
}
Mock Mode (No API Keys)
# Run without API keys (mock embeddings)
npm run match-glossaries:mock
Mock Mode Features:
- Deterministic string-based embeddings
- No API costs
- Suitable for testing
- Reproducible results
Mock Algorithm:
function mockEmbedding(text: string, dimensions: number = 1536): number[] {
const embedding = new Array(dimensions).fill(0)
for (let i = 0; i < text.length; i++) {
const charCode = text.charCodeAt(i)
embedding[i % dimensions] += charCode / 1000
}
// Normalize
const magnitude = Math.sqrt(
embedding.reduce((sum, val) => sum + val * val, 0)
)
return embedding.map(val => val / magnitude)
}
Database Integration
Semantic mappings can be stored in PostgreSQL with pgvector for efficient similarity search:
pgvector Setup
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create embeddings table
CREATE TABLE term_embeddings (
id SERIAL PRIMARY KEY,
term_id VARCHAR(255) NOT NULL,
term_name TEXT NOT NULL,
term_source VARCHAR(50) NOT NULL,
embedding vector(1536),
created_at TIMESTAMP DEFAULT NOW()
);
-- Create index for similarity search
CREATE INDEX ON term_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Find similar terms using cosine distance
SELECT
term_id,
term_name,
term_source,
1 - (embedding <=> query_embedding) as similarity
FROM term_embeddings
WHERE term_source != 'source_to_exclude'
ORDER BY embedding <=> query_embedding
LIMIT 10;
Embedding Storage
import { Pool } from 'pg'
async function storeEmbedding(
termId: string,
termName: string,
source: string,
embedding: number[]
): Promise<void> {
const pool = new Pool({
connectionString: process.env.DATABASE_URL
})
await pool.query(
`INSERT INTO term_embeddings (term_id, term_name, term_source, embedding)
VALUES ($1, $2, $3, $4)`,
[termId, termName, source, `[${embedding.join(',')}]`]
)
}
Quality Validation
Confidence Scoring
Each match is assigned a confidence score based on the matching method:
| Method | Confidence Range | Quality |
|---|---|---|
| Contextual | 0.95 - 1.0 | Excellent |
| Exact | 0.85 - 0.95 | Very Good |
| Synonym | 0.70 - 0.85 | Good |
| Semantic | 0.50 - 0.70 | Fair |
Confidence Thresholds:
- High confidence (≥ 0.85): Automatic acceptance
- Medium confidence (0.70 - 0.84): Review recommended
- Low confidence (< 0.70): Manual review required
Match Quality Indicators
Good Matches:
- ✅ High confidence score (≥ 0.85)
- ✅ Similar categories
- ✅ Same domain (food, LCA, packaging)
- ✅ Consistent descriptions
Questionable Matches:
- ⚠️ Medium confidence (0.70 - 0.84)
- ⚠️ Different categories
- ⚠️ Cross-domain mapping
- ⚠️ Partial name overlap
Poor Matches:
- ❌ Low confidence (< 0.70)
- ❌ Unrelated categories
- ❌ Different domains
- ❌ Semantic mismatch
Validation Methods
Automated Validation:
function validateMatch(match: Match): ValidationResult {
const issues: string[] = []
// Check confidence threshold
if (match.confidence < 0.5) {
issues.push('Confidence below threshold')
}
// Check category consistency
if (match.sourceCategory !== match.targetCategory) {
issues.push('Category mismatch')
}
// Check domain alignment
if (!domainsAlign(match.sourceDomain, match.targetDomain)) {
issues.push('Domain mismatch')
}
return {
valid: issues.length === 0,
confidence: match.confidence,
issues: issues,
recommendation: issues.length === 0 ? 'accept' : 'review'
}
}
Manual Review:
- Export matches to CSV/Excel
- Review low-confidence matches
- Verify cross-domain mappings
- Document validation decisions
Performance Optimization
Batch Processing
Process multiple terms in batches to reduce API calls:
async function batchEmbeddings(
texts: string[],
batchSize: number = 100
): Promise<number[][]> {
const batches: number[][][] = []
for (let i = 0; i < texts.length; i += batchSize) {
const batch = texts.slice(i, i + batchSize)
const embeddings = await Promise.all(
batch.map(text => generateEmbedding(text))
)
batches.push(embeddings)
}
return batches.flat()
}
Caching Strategy
Cache embeddings to avoid redundant API calls:
import fs from 'fs'
class EmbeddingCache {
private cache: Map<string, number[]>
private cacheFile: string
constructor(cacheFile: string) {
this.cacheFile = cacheFile
this.cache = this.loadCache()
}
async getEmbedding(text: string, provider: string): Promise<number[]> {
const cacheKey = `${provider}:${text}`
if (this.cache.has(cacheKey)) {
return this.cache.get(cacheKey)!
}
const embedding = await generateEmbedding(text, provider)
this.cache.set(cacheKey, embedding)
this.saveCache()
return embedding
}
private loadCache(): Map<string, number[]> {
if (fs.existsSync(this.cacheFile)) {
const data = JSON.parse(fs.readFileSync(this.cacheFile, 'utf8'))
return new Map(Object.entries(data))
}
return new Map()
}
private saveCache(): void {
const data = Object.fromEntries(this.cache)
fs.writeFileSync(this.cacheFile, JSON.stringify(data, null, 2))
}
}
Rate Limiting
Respect API rate limits:
import pLimit from 'p-limit'
const limit = pLimit(10) // Max 10 concurrent requests
async function matchWithRateLimit(
sourceTerms: Term[],
targetTerms: Term[]
): Promise<Match[]> {
const matches = await Promise.all(
sourceTerms.map(term =>
limit(() => findBestMatch(term, targetTerms))
)
)
return matches.filter(m => m !== null) as Match[]
}
Use Cases
Cross-Vocabulary Mapping
FoodEx2 to Hestia:
FoodEx2: A010101 (Common wheat)
↓ semantic matching
Hestia: term/crop-wheat
↓ provides
Environmental impact data
Ecoinvent to Eaternity:
Ecoinvent: market for wheat grain | GLO
↓ semantic matching
Eaternity: FlowNode.product_name = "Wheat grain"
↓ enables
EOS carbon footprint calculation
User Data Import
CSV Column Header Matching:
User Header: "Produktname" (German)
↓ semantic matching
Eaternity Property: eaternity-property-productname
↓ maps to
EOS API Field: FlowNode.product_name
Research Applications
Multi-Source LCA:
Research Question: "Carbon footprint of organic beef"
↓ semantic matching
FoodEx2: F010101 (Beef) + Organic facet
Hestia: term/livestock-cattle-organic
Ecoinvent: cattle for slaughtering, organic | CH
↓ combines
Comprehensive LCA with multiple data sources
Interactive Debugging
Web Interface Features
The ESFC Glossary website provides interactive debugging tools:
Match Visualization:
- Real-time match quality display
- Confidence score visualization
- Method indicator (contextual, exact, synonym, semantic)
- Source/target term comparison
Debugging Tools:
- Match explanation (why this match was selected)
- Alternative matches (other potential matches)
- Similarity scores for semantic matches
- Category and domain comparison
Export Options:
- Export matches to CSV
- Download relationship graph
- Generate mapping report
- Save for manual review
Running Semantic Matching
Command Line
# Full semantic matching (production)
npm run match-glossaries
# Test mode (sample data)
npm run match-glossaries:test
# Mock mode (no API keys)
npm run match-glossaries:mock
# Specific source pairs
node scripts/glossary-matcher.js \
--source foodex2 \
--target hestia \
--output mappings.json
Configuration
// scripts/glossary-matcher.js configuration
const config = {
provider: 'openai', // or 'google'
verbose: true,
mockMode: false,
confidenceThreshold: 0.5,
maxResults: 10,
batchSize: 100,
cacheFile: './cache/embeddings.json'
}
Output Formats
JSON:
{
"matches": [
{
"sourceId": "foodex2-A010101",
"sourceName": "Common wheat",
"targetId": "hestia-term-crop-wheat",
"targetName": "Wheat crop",
"confidence": 0.95,
"method": "contextual",
"validated": true
}
],
"statistics": {
"totalSourceTerms": 31601,
"totalTargetTerms": 36044,
"matchesFound": 15823,
"averageConfidence": 0.82,
"methodBreakdown": {
"contextual": 8934,
"exact": 4521,
"synonym": 1876,
"semantic": 492
}
}
}
Relationship Graph (JSON-LD):
{
"@context": "http://www.w3.org/2004/02/skos/core#",
"@graph": [
{
"@id": "foodex2:A010101",
"@type": "Concept",
"prefLabel": "Common wheat",
"exactMatch": "hestia:term-crop-wheat",
"relatedMatch": "ecoinvent:market-wheat-grain"
}
]
}
Best Practices
Matching Strategy
-
Start with High-Confidence Methods
- Rely on contextual and exact matches when possible
- Use semantic search as fallback
- Validate low-confidence matches manually
-
Domain-Specific Matching
- Match food terms to food sources (FoodEx2, Hestia)
- Match LCA processes to ecoinvent
- Match properties to Eaternity schema
-
Quality Over Quantity
- Prefer fewer high-quality matches
- Review and validate questionable matches
- Document mapping decisions
Performance Optimization
-
Batch Processing
- Process terms in batches
- Use concurrent requests (with limits)
- Cache embeddings
-
Incremental Updates
- Only re-match changed terms
- Maintain match history
- Track validation status
-
Resource Management
- Monitor API usage and costs
- Use mock mode for development
- Cache embeddings locally
Related Documentation
- Data Sources - Overview of all 10 sources
- Eaternity Schema - Property matching
- Data Formats - Export relationship data
- FoodEx2 Reference - Food classification mapping
- Hestia Reference - LCA data mapping