Checkpointing and Results
OmniRec automatically saves experiment progress and results to enable fault tolerance and result persistence.
Checkpoint Directory Structure
The checkpoint directory organizes experiments hierarchically:
checkpoints/
├── progress.json # Global progress tracker
├── out.log # Runner stdout logs
├── err.log # Runner stderr logs
└── {dataset-name}-{hash}/ # Per dataset
└── {algorithm-name}-{hash}/ # Per algorithm configuration
├── predictions.json # Model predictions
├── fold_0/ # For cross-validation
│ └── predictions.json
├── fold_1/
│ └── predictions.json
└── ...
Key files:
progress.json: Tracks experiment phases (Fit, Predict, Eval, Done) for each configuration. Enables resuming interrupted experiments.predictions.json: Contains model predictions with columns:user,item,score,rank.out.log/err.log: Runner process output for debugging.
Hash-Based Organization
The Coordinator generates unique hashes for datasets and configurations:
- Dataset hash: Based on the number of interactions, ensuring identical datasets share the same checkpoint directory
- Configuration hash: Based on algorithm name and hyperparameters, ensuring identical configurations are deduplicated
This hash-based system enables efficient caching and prevents redundant computation.
Resuming Experiments
If an experiment is interrupted, simply run it again with the same configuration:
from omnirec.util.run import run_omnirec
# First run - interrupted during training
run_omnirec(datasets=dataset, plan=plan, evaluator=evaluator)
# Second run - automatically resumes from last checkpoint
run_omnirec(datasets=dataset, plan=plan, evaluator=evaluator)
The framework automatically:
- Loads the progress tracker from
progress.json - Skips completed phases (Fit, Predict, Eval)
- Continues from the last incomplete phase
- Reuses existing predictions instead of retraining
Progress Phases
Each experiment goes through four phases:
- Fit: Train the model on training data
- Predict: Generate predictions on test data
- Eval: Compute metrics on predictions
- Done: Experiment complete
The progress.json file tracks the current phase for each experiment configuration. If interrupted, the next run resumes from the last incomplete phase.
Cross-Validation Support
For cross-validation experiments (FoldedData), progress is tracked per fold:
- Each fold goes through all phases independently
- The progress tracker maintains the current fold number
- Interrupted experiments resume from the incomplete fold
Result Format
Results are displayed in formatted tables showing:
- Algorithm: The algorithm name and configuration hash
- Fold: Cross-validation fold number (if applicable, hidden if single split)
- Metrics: All computed metrics at specified k values (e.g., NDCG@10, HR@20)
Example Output
MovieLens100K-a3f8e2c1: Evaluation Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Algorithm ┃ NDCG@10 ┃ NDCG@20 ┃ HR@10 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ LensKit.ItemKNN-b7d4a9 │ 0.3245 │ 0.3891 │ 0.6142 │
│ RecBole.BPR-c8e5f1a3 │ 0.3156 │ 0.3802 │ 0.5987 │
└────────────────────────────┴─────────┴─────────┴─────────┘
For cross-validation results with multiple folds:
MovieLens100K-a3f8e2c1: Evaluation Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓
┃ Algorithm ┃ Fold ┃ NDCG@10 ┃ HR@10 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩
│ LensKit.ItemKNN-b7d4a9 │ 0 │ 0.3201 │ 0.6089 │
│ │ 1 │ 0.3289 │ 0.6195 │
│ │ 2 │ 0.3245 │ 0.6142 │
└──────────────────────────┴──────┴───────────┴─────────┘
Caching and Deduplication
Experiments are cached based on dataset and configuration hashes. If you run the same experiment multiple times:
- The coordinator detects identical configurations via hash comparison
- Skips redundant computation (all phases marked as Done)
- Reuses cached predictions and results
This ensures efficient experimentation and prevents accidental duplication of expensive training runs.
When Caching Triggers
Caching activates when: - Same dataset (same data and number of interactions) - Same algorithm and hyperparameters - Same preprocessing pipeline
Even with different variable names or execution contexts, identical experiments are recognized and deduplicated.
Log Files
stdout (out.log)
Contains standard output from runner processes, including:
- Algorithm initialization messages
- Training progress
- Model information
- Framework-specific logs
stderr (err.log)
Contains error output from runner processes, including:
- Warning messages
- Error traces
- Framework warnings
- Debugging information
Both log files are appended to across multiple experiment runs, providing a complete history of runner activity.
Debugging
If experiments fail:
- Check
err.logfor error messages - Review
out.logfor algorithm output - Examine
progress.jsonto see which phase failed - Verify dataset preprocessing and splitting