Skip to main content

Neural Network Architecture

Eaternity Forecast uses a sophisticated transformer-based neural network with attention mechanisms to predict daily demand for restaurant menu items. This document explains the technical architecture and AI methodology.

Architecture Overview

Model Type: Transformer Architecture

Forecast employs a transformer architecture, the same foundational technology behind modern language models like GPT and BERT, adapted specifically for time-series demand forecasting.

Why Transformers for Demand Forecasting?

Traditional forecasting methods (ARIMA, exponential smoothing) struggle with:

  • Complex multi-factor relationships
  • Long-range temporal dependencies
  • Non-stationary patterns
  • Multiple simultaneous seasonal cycles

Transformers excel at:

  • Pattern recognition across different time scales (daily, weekly, seasonal)
  • Attention mechanisms that identify relevant historical periods
  • Multi-factor integration (weather, events, menu changes)
  • Handling irregular patterns (holidays, special events)

Core Components

Input Layer

Temporal Embedding

Multi-Head Attention (×4 layers)

Feed-Forward Networks

Output Projection

Prediction + Confidence Intervals

Detailed Architecture

1. Input Layer

Purpose: Transform raw sales data and external factors into numerical representations

Input Features (per item, per day):

Historical Sales Features

  • Previous 7 days sales (daily quantities)
  • Previous 4 weeks same-day-of-week sales
  • Previous month average sales
  • Previous year same-date sales (if available)

Temporal Features

  • Day of week (one-hot encoded: Monday=1, Tuesday=2, etc.)
  • Week of year (1-52)
  • Month (1-12)
  • Is weekend (binary: 0/1)
  • Is holiday (binary: 0/1)

External Features

  • Temperature (°C, normalized)
  • Precipitation (mm, normalized)
  • Day-ahead weather forecast
  • Local events (binary flags or categorical)
  • Item category (starter, main, dessert, etc.)
  • Price point (normalized)
  • Days since item launch (for new items)
  • Item availability (binary: 0/1)

Feature Engineering Example:

Input vector for "Pasta Carbonara" on Wednesday, Jan 20, 2024:

Historical Sales:
[52, 48, 45, 51, 49, 0, 0] # Last 7 days (0 = closed)
[49, 51, 48, 52] # Last 4 Wednesdays
47.3 # Last month average

Temporal:
[0, 0, 1, 0, 0, 0, 0] # Day of week (Wed = position 3)
3 # Week of year
1 # Month (January)
0 # Is weekend
0 # Is holiday

External:
8.2 # Temperature (°C)
0.0 # Precipitation
7.5 # Forecast temp tomorrow

Menu:
[0, 1, 0, 0] # Category (Main Course)
14.50 # Price (normalized to 0-1 scale)
450 # Days since launch
1 # Available today

2. Temporal Embedding

Purpose: Encode time-based patterns and cyclical relationships

Positional Encoding:

Uses sinusoidal functions to capture periodic patterns:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:

  • pos = position in sequence (day number)
  • i = dimension
  • d_model = embedding dimension (256)

Why Sinusoidal Encoding?

  • Captures multiple cycles (daily, weekly, monthly, yearly)
  • Model learns which cycles are relevant for each item
  • Enables extrapolation beyond training period
  • Handles irregular spacing (holidays, closed days)

Example:

# Weekly cycle encoding for day of week
weekly_encoding = [
sin(day_of_week / 7 * 2π),
cos(day_of_week / 7 * 2π)
]

# Annual cycle encoding
annual_encoding = [
sin(day_of_year / 365 * 2π),
cos(day_of_year / 365 * 2π)
]

3. Multi-Head Attention Mechanism

Purpose: Identify which historical periods are most relevant for current prediction

How Attention Works:

The model asks: "When predicting Wednesday lunch, which previous days should I pay attention to?"

Attention Formula:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:

  • Q (Query): What we're trying to predict (today's demand)
  • K (Keys): Historical days to consider
  • V (Values): Actual sales from those days
  • d_k: Dimension of key vectors

Multi-Head Approach:

Instead of one attention mechanism, we use 8 parallel attention heads:

  1. Head 1: Focuses on same-day-of-week patterns

    • "Wednesdays are similar to previous Wednesdays"
  2. Head 2: Focuses on recent trends

    • "Last week's pattern continues"
  3. Head 3: Focuses on seasonal patterns

    • "Similar to this time last year"
  4. Head 4: Focuses on weather correlations

    • "Cold days like today have similar demand"
  5. Head 5: Focuses on event patterns

    • "Days with similar events nearby"
  6. Head 6: Focuses on menu dynamics

    • "Similar menu composition days"
  7. Head 7: Focuses on price sensitivity

    • "Days with similar pricing strategies"
  8. Head 8: Focuses on long-range trends

    • "Multi-month trend direction"

Example Attention Weights:

Predicting Wednesday, Jan 20, 2024 for Pasta Carbonara:

Head 1 (day-of-week) attention to previous Wednesdays:
Jan 13 (last Wed): 0.35 (most recent, highest weight)
Jan 6: 0.28
Dec 30: 0.18
Dec 23: 0.12
Other Wednesdays: 0.07

Head 2 (recent trend) attention to last 7 days:
Jan 19 (yesterday): 0.42
Jan 18: 0.24
Jan 17: 0.15
Jan 16: 0.10
Older: 0.09

Head 3 (seasonal) attention to last year:
Jan 21, 2023: 0.55 (same date last year)
Jan 14-28, 2023: 0.45 (surrounding dates)

4. Feed-Forward Networks

Purpose: Non-linear transformation and feature combination

Architecture:

Input (256 dimensions)

Linear Layer 1 (256 → 1024)

ReLU Activation

Dropout (0.1)

Linear Layer 2 (1024 → 256)

Dropout (0.1)

Residual Connection + Layer Normalization

Why Two Layers?

  • Expansion (256→1024): Creates high-dimensional representation space
  • Compression (1024→256): Extracts most relevant features

Dropout Regularization:

Randomly deactivates 10% of neurons during training to prevent overfitting:

  • Model learns robust patterns, not memorization
  • Improves generalization to new data
  • Critical for small datasets (some restaurants, new items)

5. Layer Stacking

Four Transformer Blocks stacked sequentially:

Block 1: Initial pattern recognition

Block 2: Refined pattern extraction

Block 3: High-level feature learning

Block 4: Final representation

Each block contains:

  • Multi-head attention layer
  • Layer normalization
  • Feed-forward network
  • Residual connections

Why Four Layers?

Balance between:

  • Complexity: More layers = more patterns recognized
  • Efficiency: Fewer layers = faster training and inference
  • Overfitting Risk: Too many layers = memorization instead of learning

Empirical testing showed 4 layers optimal for restaurant demand forecasting.

6. Output Projection

Purpose: Convert learned representation to quantity prediction

Quantile Regression Approach:

Instead of predicting a single value, model outputs three quantiles:

Linear Layer (256 → 3)

Outputs:
- 10th percentile (lower bound)
- 50th percentile (median/point estimate)
- 90th percentile (upper bound)

Example Output:

{
"item": "Pasta Carbonara",
"date": "2024-01-20",
"predictions": {
"lower_bound": 45, // 10th percentile
"point_estimate": 52, // 50th percentile (median)
"upper_bound": 59 // 90th percentile
}
}

Why Quantile Regression?

  • Captures prediction uncertainty naturally
  • Provides actionable confidence intervals
  • More robust to outliers than variance-based approaches
  • Aligns with decision-making needs (prepare for range, not single value)

Training Process

Data Preparation

1. Data Collection

Minimum requirements:

  • 30 days historical sales (90+ days recommended)
  • Complete daily records (no gaps)
  • Item-level quantities

2. Data Preprocessing

Normalization:

# Z-score normalization for quantities
normalized_quantity = (quantity - mean) / std_dev

# Min-max normalization for external features
normalized_temp = (temp - min_temp) / (max_temp - min_temp)

Handling Missing Values:

  • Closed days: Explicitly marked (not imputed)
  • Missing sales: Forward-fill if less than 3 consecutive days
  • Weather data: Interpolate from nearby stations

Outlier Detection:

# Identify and flag (but don't remove) outliers
z_score = (quantity - rolling_mean) / rolling_std
if abs(z_score) > 3:
flag_as_potential_outlier()

Outliers preserved but weighted lower during training.

3. Sequence Generation

Create input sequences of varying lengths:

Short context (last 7 days):

[day-7, day-6, day-5, day-4, day-3, day-2, day-1] → [predict: day-0]

Medium context (last 4 weeks):

[week-4-same-day, week-3-same-day, week-2-same-day, week-1-same-day] → [predict: today]

Long context (seasonal):

[month-12-same-date, month-6-same-date, month-3-same-date] → [predict: today]

Training Algorithm

Loss Function: Quantile Loss

For each quantile q (0.1, 0.5, 0.9):

L_q = (q - 1) * error  if error < 0  (under-prediction)
q * error if error ≥ 0 (over-prediction)

Total Loss = L_0.1 + L_0.5 + L_0.9

Why This Loss?

  • Penalizes under-prediction more for upper quantile (90th)
  • Penalizes over-prediction more for lower quantile (10th)
  • Balanced for median (50th)
  • Ensures proper quantile ordering (lower < median < upper)

Optimization

Optimizer: AdamW (Adam with weight decay)

Hyperparameters:

learning_rate = 0.001       # Initial learning rate
weight_decay = 0.01 # L2 regularization
beta_1 = 0.9 # Momentum
beta_2 = 0.999 # Adaptive learning rate
epsilon = 1e-8 # Numerical stability

Learning Rate Schedule: Cosine annealing with warmup

# Warmup phase (first 10% of training)
lr = initial_lr * (step / warmup_steps)

# Cosine annealing phase
lr = min_lr + 0.5 * (max_lr - min_lr) * (1 + cos(π * step / total_steps))

Benefits:

  • Gradual warmup prevents early divergence
  • Cosine decay enables fine-tuning near end
  • Multiple cycles allow escaping local minima

Training Iterations

Epochs: 100-200 depending on data size

Batch Size: 32 sequences

Validation Split: 20% of data held out

Early Stopping:

if validation_loss_not_improved_for(patience=15_epochs):
stop_training()
restore_best_model()

Regularization Techniques

1. Dropout

  • Attention dropout: 0.1
  • Feed-forward dropout: 0.1
  • Embedding dropout: 0.05

2. Weight Decay

  • L2 penalty on weights: 0.01
  • Prevents weights from growing too large

3. Label Smoothing

  • Slightly blur target values
  • Improves calibration of confidence intervals

4. Data Augmentation

  • Random jittering of historical values (±5%)
  • Simulates measurement uncertainty
  • Improves robustness

Validation and Testing

Training/Validation/Test Split

Historical Data (180 days total):
- Training: Days 1-126 (70%)
- Validation: Days 127-162 (20%)
- Test: Days 163-180 (10%)

Walk-Forward Validation:

Instead of random split, use temporal split:

  • Train on past data only
  • Validate on future data
  • Prevents data leakage (using future to predict past)

Performance Metrics

Mean Absolute Percentage Error (MAPE):

MAPE = (1/n) * Σ |actual - predicted| / actual * 100%

Calibration Score:

Expected: 80% of actuals within confidence interval
Actual: Count how many actuals fall within [lower, upper]
Calibration = Actual Coverage / Expected Coverage

Quantile Loss:

QL = Σ all quantiles (quantile loss as defined above)

Inference Process

Daily Prediction Generation

Timeline: Runs automatically at 3:00 AM local time

Process:

  1. Data Collection (3:00-3:05 AM)

    • Fetch yesterday's final sales data
    • Retrieve weather forecast for next 7 days
    • Check event calendar for upcoming dates
    • Load current menu configuration
  2. Feature Engineering (3:05-3:10 AM)

    • Calculate rolling averages and trends
    • Encode temporal features
    • Normalize external variables
    • Create input sequences
  3. Model Inference (3:10-3:15 AM)

    • Forward pass through neural network
    • Generate predictions for next 7 days
    • Calculate confidence intervals
    • Compute accuracy metrics
  4. Post-Processing (3:15-3:20 AM)

    • Round predictions to integers
    • Apply business rules (minimum=0, maximum=capacity)
    • Flag unusual predictions for review
    • Generate accuracy reports
  5. Delivery (3:20-3:25 AM)

    • Push to API endpoints
    • Update dashboard interface
    • Send email alerts (if configured)
    • Log predictions for tracking

Latency: less than 5 minutes for typical restaurant (100 menu items)

Continuous Learning

How the Model Improves Over Time:

Weekly Retraining

Every Monday at 4:00 AM:

  1. Incorporate previous week's actual sales
  2. Retrain model with updated data
  3. Evaluate performance improvements
  4. Deploy updated model if accuracy improves

Feedback Integration

User-reported variances fed back into training:

  • Special events manually flagged
  • Unusual circumstances documented
  • Override reasons analyzed
  • Feature engineering improved

Concept Drift Detection

Monitor for changes in demand patterns:

if recent_accuracy < historical_accuracy - threshold:
trigger_model_refresh()
investigate_potential_concept_drift()

Common causes:

  • Menu changes
  • New competition
  • Seasonal transitions
  • Operational changes

Advanced Features

Multi-Item Correlation

Cross-Item Attention:

Model learns relationships between items:

  • Substitution effects: "If salmon is popular, salad demand decreases"
  • Complementary effects: "Dessert sales correlate with main course volume"
  • Menu engineering: "Specials cannibalize regular items"

Implementation:

# Attend not just to item's own history, but related items
attention_context = [
item_own_history,
substitute_items_history,
complement_items_history,
category_average_history
]

Ensemble Predictions

Multiple Model Variants:

Train several models with different architectures:

  • Model A: Transformer (primary)
  • Model B: LSTM (recurrent baseline)
  • Model C: Gradient boosting (tree-based)

Weighted Combination:

final_prediction = (
0.70 * transformer_prediction +
0.20 * lstm_prediction +
0.10 * gbm_prediction
)

Weights determined by historical accuracy.

Uncertainty Quantification

Sources of Uncertainty:

  1. Aleatoric (irreducible randomness)

    • Guest behavior inherently unpredictable
    • Random events (weather fluctuations)
  2. Epistemic (model uncertainty)

    • Insufficient training data
    • Model capacity limitations
    • New scenarios not in training set

Confidence Scoring:

confidence_score = f(
data_quality, # How clean is historical data?
training_data_volume, # How much data available?
similarity_to_training, # How similar is prediction day to training days?
model_agreement # Do ensemble models agree?
)

Higher confidence → narrower intervals Lower confidence → wider intervals

Transfer Learning

Cross-Location Learning:

For restaurant chains:

  • Pre-train on data from all locations
  • Fine-tune on individual location data
  • Transfer patterns (seasonal, weather, events)
  • Faster ramp-up for new locations

Benefits:

  • New location predictions available immediately
  • Higher initial accuracy
  • Faster model convergence
  • Shared learning across organization

Model Performance

Benchmarks

Accuracy vs Baselines:

MethodMAPENotes
Eaternity Forecast12.8%Transformer architecture
Human expert forecaster17.1%25% less accurate
Previous week same-day22.4%Naive baseline
4-week average19.7%Simple statistical baseline
ARIMA16.2%Traditional time-series
LSTM14.1%Recurrent neural network

Confidence Interval Calibration:

Expected: 80% of actuals within [lower, upper] Achieved: 78.5% (well-calibrated)

Computational Requirements

Training:

  • GPU: NVIDIA RTX 4090 or equivalent
  • RAM: 32 GB minimum
  • Training time: 2-6 hours (depends on data volume)
  • Storage: 5-20 GB per restaurant

Inference:

  • CPU: Sufficient for real-time predictions
  • RAM: 8 GB
  • Latency: less than 100ms per prediction
  • Storage: less than 1 GB for deployed model

Research Foundation

Academic References

Forecast builds on peer-reviewed research:

  1. Transformer Architecture

    • Vaswani et al. (2017) "Attention Is All You Need"
    • Original transformer paper for NLP
  2. Time Series Forecasting

    • Zhou et al. (2021) "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting"
    • Temporal transformers for forecasting
  3. Demand Forecasting

    • Taylor & Letham (2018) "Forecasting at Scale" (Facebook Prophet)
    • Industry-scale forecasting systems
  4. Quantile Regression

    • Koenker & Bassett (1978) "Regression Quantiles"
    • Foundational quantile regression theory
  5. Restaurant Operations

    • Miller et al. (2015) "Forecasting Restaurant Demand"
    • Domain-specific forecasting challenges

Future Improvements

Roadmap

Q2 2024: Attention visualization

  • Show which historical days model focuses on
  • Explain predictions to users

Q3 2024: Recipe-level forecasting

  • Predict ingredient requirements directly
  • Integration with inventory systems

Q4 2024: Causal inference

  • Understand impact of menu changes before implementation
  • Simulate promotional effects

2025: Multi-modal learning

  • Incorporate social media sentiment
  • Visual menu analysis
  • Guest review sentiment

See Also