Origin GFM
The Origin Gap Filling Module determines the geographic origin of food products when not explicitly specified. Using FAO (Food and Agriculture Organization) trade statistics, it calculates import shares and domestic production ratios to estimate where ingredients most likely originate based on the "kitchen location" (consumption country).
Quick Reference
| Property | Description |
|---|---|
| Runs on | FoodProductFlowNode (excluding subdivisions) |
| Dependencies | LocationGapFillingWorker, AddClientNodesGapFillingWorker, InventoryConnectorGapFillingWorker, MatchProductNameGapFillingWorker, LinkTermToActivityNodeGapFillingWorker, IngredientAmountEstimatorGapFillingWorker |
| Key Input | Kitchen location (country), matched product FoodEx2 terms, FAO trade data |
| Output | Origin split with country-specific percentages, location properties |
| Data Source | FAO STAT Detailed Trade Matrix, FAO Production Statistics |
When It Runs
The module triggers when:
- The node is a
FoodProductFlowNode(not a subdivision) - No origin is already specified, OR multiple origins need processing
- All dependent GFMs have completed
- Sub-activity nodes have production amounts calculated
Key Output
The module modifies the calculation graph by:
- Origin Split: Creating multiple product flow nodes, each with a specific origin country
- Percentage Allocation: Distributing quantities based on import statistics and domestic production
- Location Properties: Setting country codes, coordinates, and GADM terms
Scientific Methodology
The origin model calculates the probability distribution of where a product originates based on:
Total Supply = Domestic Production + Imports - Exports
For each country of origin, the share is calculated as:
Domestic Share = (Domestic Production - Exports) / Total Supply
Foreign Share = 1 - Domestic Share
Import Share per Country = Foreign Share * (Country Import / Total Import)
Trade Balance Model
The model uses FAO STAT data to compute origin distributions:
Domestic Production Share
domestic_import_export_total = domestic_production + import_value - export_value
domestic_origin_share = (domestic_production - export_value) / domestic_import_export_total
If domestic production exceeds exports, a portion of consumption is assumed domestic. The remaining portion is allocated to importing countries.
Import Country Distribution
For foreign origins, the module identifies the top contributing countries:
# Get countries covering 90% of total imports
m49_country_share = import_export_cache.import_countries_top90percent(
kitchen_country_m49_code, fao_code, year_column
)
for country, value in country_share.items():
country_total_percentage_share[country] = foreign_origin_share * value
The top 90% threshold reduces computational complexity while maintaining accuracy. For products without specific FAO codes (default statistics), only the top 10 countries are used.
Inconsistency Handling
FAO trade data can have inconsistencies. The module handles these cases:
| Scenario | Handling |
|---|---|
| Export >= Domestic + Import | Set "unknown origin" |
| No domestic production data | Use import statistics only |
| No trade data for product | Fall back to default FAO statistics |
| Negative total supply | Set "unknown origin" |
Implementation Details
FAO Data Configuration
The module uses two primary FAO datasets:
FAO_IMPORT_EXPORT_ZIP_URL = "https://bulks-faostat.fao.org/production/Trade_DetailedTradeMatrix_E_All_Data.zip"
FAO_DOMESTIC_ZIP_URL = "https://bulks-faostat.fao.org/production/Production_Crops_Livestock_E_All_Data.zip"
Data files extracted:
Trade_DetailedTradeMatrix_E_All_Data_NOFLAG.csv- Import/export quantities by country pairProduction_Crops_Livestock_E_All_Data_NOFLAG.csv- Domestic production by countryTrade_DetailedTradeMatrix_E_ItemCodes.csv- FAO product code definitions
Year Selection
The module supports primary and secondary years for data availability:
# If primary year data is missing, fall back to secondary year
if pd.isnull(domestic_production_value.values[0]):
stats_data_year_column = self.gfm_factory.secondary_year
domestic_production_value = domestic_production_row[stats_data_year_column]
Special FAO Codes
Custom FAO codes handle edge cases:
| FAO Code | Name | Description |
|---|---|---|
100000 | Local Production | Products like water that are locally sourced |
200000 | Default Production | Products without specific FAO mappings |
300000 | Unknown Fish Production | Fish products without trade data |
400000 | Failed Production | Fallback when origin estimation fails |
Processing Modes
Monoproduct Processing
For single-ingredient products:
async def monoproduct_origin_processing(self, calc_graph: "CalcGraph") -> None:
# 1. Check if origin already specified
origins_list = await self.parse_current_origins()
if len(origins_list) == 1:
return # Already has origin
# 2. Get kitchen location from inherited country code
kitchen_country_code = get_inherited_country_code(self.node)
kitchen_country_m49_code = iso_to_m49_mapping.get(kitchen_country_code)
# 3. Find FAO code for product via FoodEx2 mapping
fao_code_term = linked_fao_code_terms.get(foodex2_term_uids)
# 4. Calculate domestic + import shares
# 5. Apply origin split to graph
Combined Product Processing
For products with multiple ingredients:
async def combined_product_origin_processing(self, calc_graph: "CalcGraph") -> None:
origins_list = await self.parse_current_origins(combined_product_origin_processing=True)
# Equal distribution among specified origins
country_total_percentage_share = {
origin.country_code: 1 / len(origins_list)
for origin in origins_list
}
await self.apply_origin_split_to_graph(calc_graph, country_total_percentage_share, is_combined_product=True)
Origin Split Graph Mutation
When multiple origins are determined, the graph is modified:
async def apply_origin_split_to_graph(
self,
calc_graph: "CalcGraph",
country_total_percentage_share: dict,
is_origin_from_fao: bool = False,
...
) -> None:
# 1. Remove edge between product and sub-activity
remove_edge_mutation = RemoveEdgeMutation(...)
# 2. Add new Origin-Split-Activity node
activity_node = FoodProcessingActivityNode.model_construct(...)
calc_graph.apply_mutation(AddNodeMutation(...))
# 3. Create origin-specific flow nodes
for country_iso_code, percentage in country_total_percentage_share.items():
# Duplicate original node with new amount
calc_graph.apply_mutation(DuplicateNodeMutation(...))
# Set proportional amount
new_amount = percentage * activity_node.production_amount.value
# Set location property
origin_location = LocationProp.unvalidated_construct(
address=gadm_term.name,
country_code=country_iso_code,
term_uid=gadm_term.uid,
source=LocationSourceEnum.fao_stat if is_origin_from_fao else LocationSourceEnum.gadm,
)
Regional Origin Handling
When a regional origin (like "Europe") is specified, it expands to component countries:
# Handle regional terms
if term_xid in location_to_regional_term_xid_map.values():
for region in regional_term_xid_to_region_gadm_codes_map.get(term_xid, []):
if region not in already_listed_regions:
country_code = iso_3166_map_3_to_2_letter.get(region.split(".")[0])
locations.append(LocationProp.unvalidated_construct(
country_code=country_code,
term_uid=gadm_term.uid,
location_qualifier=location_qualifier,
))
Data Sources
FAO STAT Database
The primary data source is the FAO Statistics Division:
Trade Data: FAO Detailed Trade Matrix
- Contains bilateral trade flows between countries
- Quantities in tonnes
- Elements: Import Quantity, Export Quantity, Import Value, Export Value
Production Data: FAO Production Statistics
- Contains domestic production by country and product
- Quantities in tonnes
Country Code Mappings
The module uses multiple country code systems:
| System | Description | Example |
|---|---|---|
| M49 | UN numeric codes | 756 (Switzerland) |
| ISO 3166-1 alpha-2 | Two-letter codes | CH |
| ISO 3166-1 alpha-3 | Three-letter codes | CHE |
| GADM | Geographic database codes | CHE.1.2 (sub-regions) |
FoodEx2 to FAO Mapping
Products are linked via glossary service:
fao_glossary_links = await glossary_link_service.get_glossary_links_by_gfm(
gap_filling_module="FAO"
)
# Maps FoodEx2 term UIDs -> FAO code term UID
for link in fao_glossary_links:
linked_fao_code_terms[frozenset(link.term_uids)] = link.linked_term_uid
Calculation Example
Scenario: Determining origin for 1 kg of tomatoes consumed in Switzerland (CH)
Step 1: Identify Product
- FoodEx2 term matched:
A0DMX(Tomatoes) - FAO code via glossary link:
388(Tomatoes) - Kitchen country: Switzerland (M49: 756)
Step 2: Query Trade Statistics
FAO data for Switzerland and tomatoes (FAO code 388):
| Data Point | Value (tonnes) |
|---|---|
| Domestic Production | 45,000 |
| Total Imports | 75,000 |
| Total Exports | 5,000 |
Step 3: Calculate Shares
total_supply = 45,000 + 75,000 - 5,000 = 115,000
domestic_share = (45,000 - 5,000) / 115,000 = 0.348 (34.8%)
foreign_share = 1 - 0.348 = 0.652 (65.2%)
Step 4: Distribute Import Share
Top importing countries for Swiss tomato imports:
| Country | Import (tonnes) | % of Import | Final Share |
|---|---|---|---|
| Spain | 35,000 | 46.7% | 30.4% |
| Italy | 20,000 | 26.7% | 17.4% |
| Netherlands | 10,000 | 13.3% | 8.7% |
| Morocco | 5,000 | 6.7% | 4.4% |
| Other | 5,000 | 6.7% | 4.4% |
| Switzerland (domestic) | - | - | 34.8% |
Step 5: Graph Modification
The module creates origin split nodes:
Original:
Tomatoes (1 kg) -> Activity -> ...
After Origin GFM:
Tomatoes (1 kg) -> Origin-Split-Activity
|-> Tomatoes (0.348 kg, CH) -> Activity (copy) -> ...
|-> Tomatoes (0.304 kg, ES) -> Activity (copy) -> ...
|-> Tomatoes (0.174 kg, IT) -> Activity (copy) -> ...
|-> Tomatoes (0.087 kg, NL) -> Activity (copy) -> ...
|-> Tomatoes (0.044 kg, MA) -> Activity (copy) -> ...
|-> Tomatoes (0.044 kg, OTHER) -> Activity (original) -> ...
Each origin-specific flow then receives appropriate transport calculations downstream.
Caching System
Import/Export Cache
Trade data is cached in MessagePack format for performance:
class ImportExportCache:
def import_export_value(self, m49_code: int, fao_code: int, year_column: str) -> tuple[int, int]:
"""Returns import and export sums from cache."""
return self.import_export_data[(m49_code, fao_code)][year_column]
def import_countries_top90percent(self, m49_code: int, fao_code: int, year_column: str) -> dict:
"""Returns import relative share between countries covering top 90 percent."""
return self.import_export_data["top90"][(m49_code, fao_code)][year_column]
Cache File Structure
temp_data/origin_gfm/
├── domestic_df_{suffix}.hdf5 # Domestic production (HDF5)
├── import_export_cache_{suffix}.msgpack # Pre-aggregated trade data
├── item_codes_{suffix}.csv # FAO product codes
├── import_export_{version}.zip # Source archive
└── domestic_{version}.zip # Source archive
Google Drive Integration
Cache files are synchronized with Google Drive for shared access:
def download_file_from_google_drive(self, drive_service: Resource, filename: str, file_id: str):
"""Download cache file from Google Drive if not available locally."""
def upload_cache_file_to_google_drive(self, drive_service: Resource, filename: str):
"""Upload regenerated cache file to Google Drive."""
Known Limitations
Data Quality
- Rotterdam Effect: Goods re-exported through trade hubs (Netherlands, Belgium) may appear as originating from those countries rather than true origins
- FAO Data Gaps: Some products lack trade statistics, falling back to default origin estimation
- Inconsistent Balances: Some country/product combinations have exports exceeding production + imports
- Temporal Lag: FAO data publication typically lags 1-2 years behind current date
Coverage Gaps
- Fish Products: Limited FAO trade data for fish; uses "unknown origin" with default transport
- Processed Products: Complex processed foods may not have direct FAO mappings
- Regional Products: Products like local water or specialty items use "local production" code
Model Assumptions
- Equal Distribution: When multiple origins are user-specified without percentages, they receive equal shares
- Kitchen Location Required: Origin estimation requires a known consumption country
- Top 10 Country Limit: Default FAO statistics only use top 10 import countries to limit graph complexity
- 90% Coverage Threshold: Minor import sources below 10% cumulative share are excluded
Processing Constraints
- Non-Food Exclusion: Products matched to non-food FoodEx2 terms (EAT-0002, EAT-0000) are skipped
- Subdivision Handling: Subdivision nodes are processed differently (skipped by Origin GFM)
- Combined Products: Regional origins on combined products are not expanded to avoid graph explosion
References
-
FAO Statistics Division. FAOSTAT Database. Food and Agriculture Organization of the United Nations.
-
FAO Detailed Trade Matrix. Trade Data. Bilateral trade flows for agricultural commodities.
-
BACI Database (Alternative). CEPII BACI. International trade database with improved origin tracking.
-
GADM Database. Global Administrative Areas. Geographic boundaries and administrative regions.
-
FoodEx2 Classification. EFSA FoodEx2. European Food Safety Authority food classification system.