Neural Network Architecture
Eaternity Forecast uses a sophisticated transformer-based neural network with attention mechanisms to predict daily demand for restaurant menu items. This document explains the technical architecture and AI methodology.
Architecture Overview
Model Type: Transformer Architecture
Forecast employs a transformer architecture, the same foundational technology behind modern language models like GPT and BERT, adapted specifically for time-series demand forecasting.
Why Transformers for Demand Forecasting?
Traditional forecasting methods (ARIMA, exponential smoothing) struggle with:
- Complex multi-factor relationships
- Long-range temporal dependencies
- Non-stationary patterns
- Multiple simultaneous seasonal cycles
Transformers excel at:
- Pattern recognition across different time scales (daily, weekly, seasonal)
- Attention mechanisms that identify relevant historical periods
- Multi-factor integration (weather, events, menu changes)
- Handling irregular patterns (holidays, special events)
Core Components
Input Layer
↓
Temporal Embedding
↓
Multi-Head Attention (×4 layers)
↓
Feed-Forward Networks
↓
Output Projection
↓
Prediction + Confidence Intervals
Detailed Architecture
1. Input Layer
Purpose: Transform raw sales data and external factors into numerical representations
Input Features (per item, per day):
Historical Sales Features
- Previous 7 days sales (daily quantities)
- Previous 4 weeks same-day-of-week sales
- Previous month average sales
- Previous year same-date sales (if available)
Temporal Features
- Day of week (one-hot encoded: Monday=1, Tuesday=2, etc.)
- Week of year (1-52)
- Month (1-12)
- Is weekend (binary: 0/1)
- Is holiday (binary: 0/1)
External Features
- Temperature (°C, normalized)
- Precipitation (mm, normalized)
- Day-ahead weather forecast
- Local events (binary flags or categorical)
Menu Features
- Item category (starter, main, dessert, etc.)
- Price point (normalized)
- Days since item launch (for new items)
- Item availability (binary: 0/1)
Feature Engineering Example:
Input vector for "Pasta Carbonara" on Wednesday, Jan 20, 2024:
Historical Sales:
[52, 48, 45, 51, 49, 0, 0] # Last 7 days (0 = closed)
[49, 51, 48, 52] # Last 4 Wednesdays
47.3 # Last month average
Temporal:
[0, 0, 1, 0, 0, 0, 0] # Day of week (Wed = position 3)
3 # Week of year
1 # Month (January)
0 # Is weekend
0 # Is holiday
External:
8.2 # Temperature (°C)
0.0 # Precipitation
7.5 # Forecast temp tomorrow
Menu:
[0, 1, 0, 0] # Category (Main Course)
14.50 # Price (normalized to 0-1 scale)
450 # Days since launch
1 # Available today
2. Temporal Embedding
Purpose: Encode time-based patterns and cyclical relationships
Positional Encoding:
Uses sinusoidal functions to capture periodic patterns:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where:
pos= position in sequence (day number)i= dimensiond_model= embedding dimension (256)
Why Sinusoidal Encoding?
- Captures multiple cycles (daily, weekly, monthly, yearly)
- Model learns which cycles are relevant for each item
- Enables extrapolation beyond training period
- Handles irregular spacing (holidays, closed days)
Example:
# Weekly cycle encoding for day of week
weekly_encoding = [
sin(day_of_week / 7 * 2π),
cos(day_of_week / 7 * 2π)
]
# Annual cycle encoding
annual_encoding = [
sin(day_of_year / 365 * 2π),
cos(day_of_year / 365 * 2π)
]
3. Multi-Head Attention Mechanism
Purpose: Identify which historical periods are most relevant for current prediction
How Attention Works:
The model asks: "When predicting Wednesday lunch, which previous days should I pay attention to?"
Attention Formula:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Where:
- Q (Query): What we're trying to predict (today's demand)
- K (Keys): Historical days to consider
- V (Values): Actual sales from those days
- d_k: Dimension of key vectors
Multi-Head Approach:
Instead of one attention mechanism, we use 8 parallel attention heads:
-
Head 1: Focuses on same-day-of-week patterns
- "Wednesdays are similar to previous Wednesdays"
-
Head 2: Focuses on recent trends
- "Last week's pattern continues"
-
Head 3: Focuses on seasonal patterns
- "Similar to this time last year"
-
Head 4: Focuses on weather correlations
- "Cold days like today have similar demand"
-
Head 5: Focuses on event patterns
- "Days with similar events nearby"
-
Head 6: Focuses on menu dynamics
- "Similar menu composition days"
-
Head 7: Focuses on price sensitivity
- "Days with similar pricing strategies"
-
Head 8: Focuses on long-range trends
- "Multi-month trend direction"
Example Attention Weights:
Predicting Wednesday, Jan 20, 2024 for Pasta Carbonara:
Head 1 (day-of-week) attention to previous Wednesdays:
Jan 13 (last Wed): 0.35 (most recent, highest weight)
Jan 6: 0.28
Dec 30: 0.18
Dec 23: 0.12
Other Wednesdays: 0.07
Head 2 (recent trend) attention to last 7 days:
Jan 19 (yesterday): 0.42
Jan 18: 0.24
Jan 17: 0.15
Jan 16: 0.10
Older: 0.09
Head 3 (seasonal) attention to last year:
Jan 21, 2023: 0.55 (same date last year)
Jan 14-28, 2023: 0.45 (surrounding dates)
4. Feed-Forward Networks
Purpose: Non-linear transformation and feature combination
Architecture:
Input (256 dimensions)
↓
Linear Layer 1 (256 → 1024)
↓
ReLU Activation
↓
Dropout (0.1)
↓
Linear Layer 2 (1024 → 256)
↓
Dropout (0.1)
↓
Residual Connection + Layer Normalization
Why Two Layers?
- Expansion (256→1024): Creates high-dimensional representation space
- Compression (1024→256): Extracts most relevant features
Dropout Regularization:
Randomly deactivates 10% of neurons during training to prevent overfitting:
- Model learns robust patterns, not memorization
- Improves generalization to new data
- Critical for small datasets (some restaurants, new items)
5. Layer Stacking
Four Transformer Blocks stacked sequentially:
Block 1: Initial pattern recognition
↓
Block 2: Refined pattern extraction
↓
Block 3: High-level feature learning
↓
Block 4: Final representation
Each block contains:
- Multi-head attention layer
- Layer normalization
- Feed-forward network
- Residual connections
Why Four Layers?
Balance between:
- Complexity: More layers = more patterns recognized
- Efficiency: Fewer layers = faster training and inference
- Overfitting Risk: Too many layers = memorization instead of learning
Empirical testing showed 4 layers optimal for restaurant demand forecasting.
6. Output Projection
Purpose: Convert learned representation to quantity prediction
Quantile Regression Approach:
Instead of predicting a single value, model outputs three quantiles:
Linear Layer (256 → 3)
↓
Outputs:
- 10th percentile (lower bound)
- 50th percentile (median/point estimate)
- 90th percentile (upper bound)
Example Output:
{
"item": "Pasta Carbonara",
"date": "2024-01-20",
"predictions": {
"lower_bound": 45, // 10th percentile
"point_estimate": 52, // 50th percentile (median)
"upper_bound": 59 // 90th percentile
}
}
Why Quantile Regression?
- Captures prediction uncertainty naturally
- Provides actionable confidence intervals
- More robust to outliers than variance-based approaches
- Aligns with decision-making needs (prepare for range, not single value)
Training Process
Data Preparation
1. Data Collection
Minimum requirements:
- 30 days historical sales (90+ days recommended)
- Complete daily records (no gaps)
- Item-level quantities
2. Data Preprocessing
Normalization:
# Z-score normalization for quantities
normalized_quantity = (quantity - mean) / std_dev
# Min-max normalization for external features
normalized_temp = (temp - min_temp) / (max_temp - min_temp)
Handling Missing Values:
- Closed days: Explicitly marked (not imputed)
- Missing sales: Forward-fill if less than 3 consecutive days
- Weather data: Interpolate from nearby stations
Outlier Detection:
# Identify and flag (but don't remove) outliers
z_score = (quantity - rolling_mean) / rolling_std
if abs(z_score) > 3:
flag_as_potential_outlier()
Outliers preserved but weighted lower during training.
3. Sequence Generation
Create input sequences of varying lengths:
Short context (last 7 days):
[day-7, day-6, day-5, day-4, day-3, day-2, day-1] → [predict: day-0]
Medium context (last 4 weeks):
[week-4-same-day, week-3-same-day, week-2-same-day, week-1-same-day] → [predict: today]
Long context (seasonal):
[month-12-same-date, month-6-same-date, month-3-same-date] → [predict: today]
Training Algorithm
Loss Function: Quantile Loss
For each quantile q (0.1, 0.5, 0.9):
L_q = (q - 1) * error if error < 0 (under-prediction)
q * error if error ≥ 0 (over-prediction)
Total Loss = L_0.1 + L_0.5 + L_0.9
Why This Loss?
- Penalizes under-prediction more for upper quantile (90th)
- Penalizes over-prediction more for lower quantile (10th)
- Balanced for median (50th)
- Ensures proper quantile ordering (lower < median < upper)
Optimization
Optimizer: AdamW (Adam with weight decay)
Hyperparameters:
learning_rate = 0.001 # Initial learning rate
weight_decay = 0.01 # L2 regularization
beta_1 = 0.9 # Momentum
beta_2 = 0.999 # Adaptive learning rate
epsilon = 1e-8 # Numerical stability
Learning Rate Schedule: Cosine annealing with warmup
# Warmup phase (first 10% of training)
lr = initial_lr * (step / warmup_steps)
# Cosine annealing phase
lr = min_lr + 0.5 * (max_lr - min_lr) * (1 + cos(π * step / total_steps))
Benefits:
- Gradual warmup prevents early divergence
- Cosine decay enables fine-tuning near end
- Multiple cycles allow escaping local minima
Training Iterations
Epochs: 100-200 depending on data size
Batch Size: 32 sequences
Validation Split: 20% of data held out
Early Stopping:
if validation_loss_not_improved_for(patience=15_epochs):
stop_training()
restore_best_model()
Regularization Techniques
1. Dropout
- Attention dropout: 0.1
- Feed-forward dropout: 0.1
- Embedding dropout: 0.05
2. Weight Decay
- L2 penalty on weights: 0.01
- Prevents weights from growing too large
3. Label Smoothing
- Slightly blur target values
- Improves calibration of confidence intervals
4. Data Augmentation
- Random jittering of historical values (±5%)
- Simulates measurement uncertainty
- Improves robustness
Validation and Testing
Training/Validation/Test Split
Historical Data (180 days total):
- Training: Days 1-126 (70%)
- Validation: Days 127-162 (20%)
- Test: Days 163-180 (10%)
Walk-Forward Validation:
Instead of random split, use temporal split:
- Train on past data only
- Validate on future data
- Prevents data leakage (using future to predict past)
Performance Metrics
Mean Absolute Percentage Error (MAPE):
MAPE = (1/n) * Σ |actual - predicted| / actual * 100%
Calibration Score:
Expected: 80% of actuals within confidence interval
Actual: Count how many actuals fall within [lower, upper]
Calibration = Actual Coverage / Expected Coverage
Quantile Loss:
QL = Σ all quantiles (quantile loss as defined above)
Inference Process
Daily Prediction Generation
Timeline: Runs automatically at 3:00 AM local time
Process:
-
Data Collection (3:00-3:05 AM)
- Fetch yesterday's final sales data
- Retrieve weather forecast for next 7 days
- Check event calendar for upcoming dates
- Load current menu configuration
-
Feature Engineering (3:05-3:10 AM)
- Calculate rolling averages and trends
- Encode temporal features
- Normalize external variables
- Create input sequences
-
Model Inference (3:10-3:15 AM)
- Forward pass through neural network
- Generate predictions for next 7 days
- Calculate confidence intervals
- Compute accuracy metrics
-
Post-Processing (3:15-3:20 AM)
- Round predictions to integers
- Apply business rules (minimum=0, maximum=capacity)
- Flag unusual predictions for review
- Generate accuracy reports
-
Delivery (3:20-3:25 AM)
- Push to API endpoints
- Update dashboard interface
- Send email alerts (if configured)
- Log predictions for tracking
Latency: less than 5 minutes for typical restaurant (100 menu items)
Continuous Learning
How the Model Improves Over Time:
Weekly Retraining
Every Monday at 4:00 AM:
- Incorporate previous week's actual sales
- Retrain model with updated data
- Evaluate performance improvements
- Deploy updated model if accuracy improves
Feedback Integration
User-reported variances fed back into training:
- Special events manually flagged
- Unusual circumstances documented
- Override reasons analyzed
- Feature engineering improved
Concept Drift Detection
Monitor for changes in demand patterns:
if recent_accuracy < historical_accuracy - threshold:
trigger_model_refresh()
investigate_potential_concept_drift()
Common causes:
- Menu changes
- New competition
- Seasonal transitions
- Operational changes
Advanced Features
Multi-Item Correlation
Cross-Item Attention:
Model learns relationships between items:
- Substitution effects: "If salmon is popular, salad demand decreases"
- Complementary effects: "Dessert sales correlate with main course volume"
- Menu engineering: "Specials cannibalize regular items"
Implementation:
# Attend not just to item's own history, but related items
attention_context = [
item_own_history,
substitute_items_history,
complement_items_history,
category_average_history
]
Ensemble Predictions
Multiple Model Variants:
Train several models with different architectures:
- Model A: Transformer (primary)
- Model B: LSTM (recurrent baseline)
- Model C: Gradient boosting (tree-based)
Weighted Combination:
final_prediction = (
0.70 * transformer_prediction +
0.20 * lstm_prediction +
0.10 * gbm_prediction
)
Weights determined by historical accuracy.
Uncertainty Quantification
Sources of Uncertainty:
-
Aleatoric (irreducible randomness)
- Guest behavior inherently unpredictable
- Random events (weather fluctuations)
-
Epistemic (model uncertainty)
- Insufficient training data
- Model capacity limitations
- New scenarios not in training set
Confidence Scoring:
confidence_score = f(
data_quality, # How clean is historical data?
training_data_volume, # How much data available?
similarity_to_training, # How similar is prediction day to training days?
model_agreement # Do ensemble models agree?
)
Higher confidence → narrower intervals Lower confidence → wider intervals
Transfer Learning
Cross-Location Learning:
For restaurant chains:
- Pre-train on data from all locations
- Fine-tune on individual location data
- Transfer patterns (seasonal, weather, events)
- Faster ramp-up for new locations
Benefits:
- New location predictions available immediately
- Higher initial accuracy
- Faster model convergence
- Shared learning across organization
Model Performance
Benchmarks
Accuracy vs Baselines:
| Method | MAPE | Notes |
|---|---|---|
| Eaternity Forecast | 12.8% | Transformer architecture |
| Human expert forecaster | 17.1% | 25% less accurate |
| Previous week same-day | 22.4% | Naive baseline |
| 4-week average | 19.7% | Simple statistical baseline |
| ARIMA | 16.2% | Traditional time-series |
| LSTM | 14.1% | Recurrent neural network |
Confidence Interval Calibration:
Expected: 80% of actuals within [lower, upper] Achieved: 78.5% (well-calibrated)
Computational Requirements
Training:
- GPU: NVIDIA RTX 4090 or equivalent
- RAM: 32 GB minimum
- Training time: 2-6 hours (depends on data volume)
- Storage: 5-20 GB per restaurant
Inference:
- CPU: Sufficient for real-time predictions
- RAM: 8 GB
- Latency: less than 100ms per prediction
- Storage: less than 1 GB for deployed model
Research Foundation
Academic References
Forecast builds on peer-reviewed research:
-
Transformer Architecture
- Vaswani et al. (2017) "Attention Is All You Need"
- Original transformer paper for NLP
-
Time Series Forecasting
- Zhou et al. (2021) "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting"
- Temporal transformers for forecasting
-
Demand Forecasting
- Taylor & Letham (2018) "Forecasting at Scale" (Facebook Prophet)
- Industry-scale forecasting systems
-
Quantile Regression
- Koenker & Bassett (1978) "Regression Quantiles"
- Foundational quantile regression theory
-
Restaurant Operations
- Miller et al. (2015) "Forecasting Restaurant Demand"
- Domain-specific forecasting challenges
Future Improvements
Roadmap
Q2 2024: Attention visualization
- Show which historical days model focuses on
- Explain predictions to users
Q3 2024: Recipe-level forecasting
- Predict ingredient requirements directly
- Integration with inventory systems
Q4 2024: Causal inference
- Understand impact of menu changes before implementation
- Simulate promotional effects
2025: Multi-modal learning
- Incorporate social media sentiment
- Visual menu analysis
- Guest review sentiment
See Also
- Performance Study — Validation results and benchmarks
- Prediction Confidence — Understanding uncertainty
- Implementation Guide — Best practices for use
- API Reference — Technical integration details