Skip to main content

Origin GFM

The Origin Gap Filling Module determines the geographic origin of food products when not explicitly specified. Using FAO (Food and Agriculture Organization) trade statistics, it calculates import shares and domestic production ratios to estimate where ingredients most likely originate based on the "kitchen location" (consumption country).

Quick Reference

PropertyDescription
Runs onFoodProductFlowNode (excluding subdivisions)
DependenciesLocationGapFillingWorker, AddClientNodesGapFillingWorker, InventoryConnectorGapFillingWorker, MatchProductNameGapFillingWorker, LinkTermToActivityNodeGapFillingWorker, IngredientAmountEstimatorGapFillingWorker
Key InputKitchen location (country), matched product FoodEx2 terms, FAO trade data
OutputOrigin split with country-specific percentages, location properties
Data SourceFAO STAT Detailed Trade Matrix, FAO Production Statistics

When It Runs

The module triggers when:

  1. The node is a FoodProductFlowNode (not a subdivision)
  2. No origin is already specified, OR multiple origins need processing
  3. All dependent GFMs have completed
  4. Sub-activity nodes have production amounts calculated

Key Output

The module modifies the calculation graph by:

  • Origin Split: Creating multiple product flow nodes, each with a specific origin country
  • Percentage Allocation: Distributing quantities based on import statistics and domestic production
  • Location Properties: Setting country codes, coordinates, and GADM terms

Scientific Methodology

The origin model calculates the probability distribution of where a product originates based on:

Total Supply = Domestic Production + Imports - Exports

For each country of origin, the share is calculated as:

Domestic Share = (Domestic Production - Exports) / Total Supply
Foreign Share = 1 - Domestic Share
Import Share per Country = Foreign Share * (Country Import / Total Import)

Trade Balance Model

The model uses FAO STAT data to compute origin distributions:

Domestic Production Share

domestic_import_export_total = domestic_production + import_value - export_value

domestic_origin_share = (domestic_production - export_value) / domestic_import_export_total

If domestic production exceeds exports, a portion of consumption is assumed domestic. The remaining portion is allocated to importing countries.

Import Country Distribution

For foreign origins, the module identifies the top contributing countries:

# Get countries covering 90% of total imports
m49_country_share = import_export_cache.import_countries_top90percent(
kitchen_country_m49_code, fao_code, year_column
)

for country, value in country_share.items():
country_total_percentage_share[country] = foreign_origin_share * value

The top 90% threshold reduces computational complexity while maintaining accuracy. For products without specific FAO codes (default statistics), only the top 10 countries are used.

Inconsistency Handling

FAO trade data can have inconsistencies. The module handles these cases:

ScenarioHandling
Export >= Domestic + ImportSet "unknown origin"
No domestic production dataUse import statistics only
No trade data for productFall back to default FAO statistics
Negative total supplySet "unknown origin"

Implementation Details

FAO Data Configuration

The module uses two primary FAO datasets:

FAO_IMPORT_EXPORT_ZIP_URL = "https://bulks-faostat.fao.org/production/Trade_DetailedTradeMatrix_E_All_Data.zip"
FAO_DOMESTIC_ZIP_URL = "https://bulks-faostat.fao.org/production/Production_Crops_Livestock_E_All_Data.zip"

Data files extracted:

  • Trade_DetailedTradeMatrix_E_All_Data_NOFLAG.csv - Import/export quantities by country pair
  • Production_Crops_Livestock_E_All_Data_NOFLAG.csv - Domestic production by country
  • Trade_DetailedTradeMatrix_E_ItemCodes.csv - FAO product code definitions

Year Selection

The module supports primary and secondary years for data availability:

# If primary year data is missing, fall back to secondary year
if pd.isnull(domestic_production_value.values[0]):
stats_data_year_column = self.gfm_factory.secondary_year
domestic_production_value = domestic_production_row[stats_data_year_column]

Special FAO Codes

Custom FAO codes handle edge cases:

FAO CodeNameDescription
100000Local ProductionProducts like water that are locally sourced
200000Default ProductionProducts without specific FAO mappings
300000Unknown Fish ProductionFish products without trade data
400000Failed ProductionFallback when origin estimation fails

Processing Modes

Monoproduct Processing

For single-ingredient products:

async def monoproduct_origin_processing(self, calc_graph: "CalcGraph") -> None:
# 1. Check if origin already specified
origins_list = await self.parse_current_origins()

if len(origins_list) == 1:
return # Already has origin

# 2. Get kitchen location from inherited country code
kitchen_country_code = get_inherited_country_code(self.node)
kitchen_country_m49_code = iso_to_m49_mapping.get(kitchen_country_code)

# 3. Find FAO code for product via FoodEx2 mapping
fao_code_term = linked_fao_code_terms.get(foodex2_term_uids)

# 4. Calculate domestic + import shares
# 5. Apply origin split to graph

Combined Product Processing

For products with multiple ingredients:

async def combined_product_origin_processing(self, calc_graph: "CalcGraph") -> None:
origins_list = await self.parse_current_origins(combined_product_origin_processing=True)

# Equal distribution among specified origins
country_total_percentage_share = {
origin.country_code: 1 / len(origins_list)
for origin in origins_list
}

await self.apply_origin_split_to_graph(calc_graph, country_total_percentage_share, is_combined_product=True)

Origin Split Graph Mutation

When multiple origins are determined, the graph is modified:

async def apply_origin_split_to_graph(
self,
calc_graph: "CalcGraph",
country_total_percentage_share: dict,
is_origin_from_fao: bool = False,
...
) -> None:
# 1. Remove edge between product and sub-activity
remove_edge_mutation = RemoveEdgeMutation(...)

# 2. Add new Origin-Split-Activity node
activity_node = FoodProcessingActivityNode.model_construct(...)
calc_graph.apply_mutation(AddNodeMutation(...))

# 3. Create origin-specific flow nodes
for country_iso_code, percentage in country_total_percentage_share.items():
# Duplicate original node with new amount
calc_graph.apply_mutation(DuplicateNodeMutation(...))

# Set proportional amount
new_amount = percentage * activity_node.production_amount.value

# Set location property
origin_location = LocationProp.unvalidated_construct(
address=gadm_term.name,
country_code=country_iso_code,
term_uid=gadm_term.uid,
source=LocationSourceEnum.fao_stat if is_origin_from_fao else LocationSourceEnum.gadm,
)

Regional Origin Handling

When a regional origin (like "Europe") is specified, it expands to component countries:

# Handle regional terms
if term_xid in location_to_regional_term_xid_map.values():
for region in regional_term_xid_to_region_gadm_codes_map.get(term_xid, []):
if region not in already_listed_regions:
country_code = iso_3166_map_3_to_2_letter.get(region.split(".")[0])
locations.append(LocationProp.unvalidated_construct(
country_code=country_code,
term_uid=gadm_term.uid,
location_qualifier=location_qualifier,
))

Data Sources

FAO STAT Database

The primary data source is the FAO Statistics Division:

Trade Data: FAO Detailed Trade Matrix

  • Contains bilateral trade flows between countries
  • Quantities in tonnes
  • Elements: Import Quantity, Export Quantity, Import Value, Export Value

Production Data: FAO Production Statistics

  • Contains domestic production by country and product
  • Quantities in tonnes

Country Code Mappings

The module uses multiple country code systems:

SystemDescriptionExample
M49UN numeric codes756 (Switzerland)
ISO 3166-1 alpha-2Two-letter codesCH
ISO 3166-1 alpha-3Three-letter codesCHE
GADMGeographic database codesCHE.1.2 (sub-regions)

FoodEx2 to FAO Mapping

Products are linked via glossary service:

fao_glossary_links = await glossary_link_service.get_glossary_links_by_gfm(
gap_filling_module="FAO"
)

# Maps FoodEx2 term UIDs -> FAO code term UID
for link in fao_glossary_links:
linked_fao_code_terms[frozenset(link.term_uids)] = link.linked_term_uid

Calculation Example

Scenario: Determining origin for 1 kg of tomatoes consumed in Switzerland (CH)

Step 1: Identify Product

  • FoodEx2 term matched: A0DMX (Tomatoes)
  • FAO code via glossary link: 388 (Tomatoes)
  • Kitchen country: Switzerland (M49: 756)

Step 2: Query Trade Statistics

FAO data for Switzerland and tomatoes (FAO code 388):

Data PointValue (tonnes)
Domestic Production45,000
Total Imports75,000
Total Exports5,000

Step 3: Calculate Shares

total_supply = 45,000 + 75,000 - 5,000 = 115,000

domestic_share = (45,000 - 5,000) / 115,000 = 0.348 (34.8%)
foreign_share = 1 - 0.348 = 0.652 (65.2%)

Step 4: Distribute Import Share

Top importing countries for Swiss tomato imports:

CountryImport (tonnes)% of ImportFinal Share
Spain35,00046.7%30.4%
Italy20,00026.7%17.4%
Netherlands10,00013.3%8.7%
Morocco5,0006.7%4.4%
Other5,0006.7%4.4%
Switzerland (domestic)--34.8%

Step 5: Graph Modification

The module creates origin split nodes:

Original:
Tomatoes (1 kg) -> Activity -> ...

After Origin GFM:
Tomatoes (1 kg) -> Origin-Split-Activity
|-> Tomatoes (0.348 kg, CH) -> Activity (copy) -> ...
|-> Tomatoes (0.304 kg, ES) -> Activity (copy) -> ...
|-> Tomatoes (0.174 kg, IT) -> Activity (copy) -> ...
|-> Tomatoes (0.087 kg, NL) -> Activity (copy) -> ...
|-> Tomatoes (0.044 kg, MA) -> Activity (copy) -> ...
|-> Tomatoes (0.044 kg, OTHER) -> Activity (original) -> ...

Each origin-specific flow then receives appropriate transport calculations downstream.


Caching System

Import/Export Cache

Trade data is cached in MessagePack format for performance:

class ImportExportCache:
def import_export_value(self, m49_code: int, fao_code: int, year_column: str) -> tuple[int, int]:
"""Returns import and export sums from cache."""
return self.import_export_data[(m49_code, fao_code)][year_column]

def import_countries_top90percent(self, m49_code: int, fao_code: int, year_column: str) -> dict:
"""Returns import relative share between countries covering top 90 percent."""
return self.import_export_data["top90"][(m49_code, fao_code)][year_column]

Cache File Structure

temp_data/origin_gfm/
├── domestic_df_{suffix}.hdf5 # Domestic production (HDF5)
├── import_export_cache_{suffix}.msgpack # Pre-aggregated trade data
├── item_codes_{suffix}.csv # FAO product codes
├── import_export_{version}.zip # Source archive
└── domestic_{version}.zip # Source archive

Google Drive Integration

Cache files are synchronized with Google Drive for shared access:

def download_file_from_google_drive(self, drive_service: Resource, filename: str, file_id: str):
"""Download cache file from Google Drive if not available locally."""

def upload_cache_file_to_google_drive(self, drive_service: Resource, filename: str):
"""Upload regenerated cache file to Google Drive."""

Known Limitations

Data Quality

  • Rotterdam Effect: Goods re-exported through trade hubs (Netherlands, Belgium) may appear as originating from those countries rather than true origins
  • FAO Data Gaps: Some products lack trade statistics, falling back to default origin estimation
  • Inconsistent Balances: Some country/product combinations have exports exceeding production + imports
  • Temporal Lag: FAO data publication typically lags 1-2 years behind current date

Coverage Gaps

  • Fish Products: Limited FAO trade data for fish; uses "unknown origin" with default transport
  • Processed Products: Complex processed foods may not have direct FAO mappings
  • Regional Products: Products like local water or specialty items use "local production" code

Model Assumptions

  • Equal Distribution: When multiple origins are user-specified without percentages, they receive equal shares
  • Kitchen Location Required: Origin estimation requires a known consumption country
  • Top 10 Country Limit: Default FAO statistics only use top 10 import countries to limit graph complexity
  • 90% Coverage Threshold: Minor import sources below 10% cumulative share are excluded

Processing Constraints

  • Non-Food Exclusion: Products matched to non-food FoodEx2 terms (EAT-0002, EAT-0000) are skipped
  • Subdivision Handling: Subdivision nodes are processed differently (skipped by Origin GFM)
  • Combined Products: Regional origins on combined products are not expanded to avoid graph explosion

References

  1. FAO Statistics Division. FAOSTAT Database. Food and Agriculture Organization of the United Nations.

  2. FAO Detailed Trade Matrix. Trade Data. Bilateral trade flows for agricultural commodities.

  3. BACI Database (Alternative). CEPII BACI. International trade database with improved origin tracking.

  4. GADM Database. Global Administrative Areas. Geographic boundaries and administrative regions.

  5. FoodEx2 Classification. EFSA FoodEx2. European Food Safety Authority food classification system.