Checkpointing and Results

OmniRec automatically saves experiment progress and results to enable fault tolerance and result persistence.

Checkpoint Directory Structure

The checkpoint directory organizes experiments hierarchically:

checkpoints/
├── progress.json                                    # Global progress tracker
├── out.log                                          # Runner stdout logs
├── err.log                                          # Runner stderr logs
└── {dataset-name}-{hash}/                          # Per dataset
    └── {algorithm-name}-{hash}/                    # Per algorithm configuration
        ├── predictions.json                        # Model predictions
        ├── fold_0/                                 # For cross-validation
        │   └── predictions.json
        ├── fold_1/
        │   └── predictions.json
        └── ...

Key files:

progress.json: Tracks experiment phases (Fit, Predict, Eval, Done) for each configuration. Enables resuming interrupted experiments.
predictions.json: Contains model predictions with columns: user, item, score, rank.
out.log / err.log: Runner process output for debugging.

Hash-Based Organization

The Coordinator generates unique hashes for datasets and configurations:

Dataset hash: Based on the number of interactions, ensuring identical datasets share the same checkpoint directory
Configuration hash: Based on algorithm name and hyperparameters, ensuring identical configurations are deduplicated

This hash-based system enables efficient caching and prevents redundant computation.

Resuming Experiments

If an experiment is interrupted, simply run it again with the same configuration:

from omnirec.util.run import run_omnirec

# First run - interrupted during training
run_omnirec(datasets=dataset, plan=plan, evaluator=evaluator)

# Second run - automatically resumes from last checkpoint
run_omnirec(datasets=dataset, plan=plan, evaluator=evaluator)

The framework automatically: 1. Loads the progress tracker from progress.json 2. Skips completed phases (Fit, Predict, Eval) 3. Continues from the last incomplete phase 4. Reuses existing predictions instead of retraining

Progress Phases

Each experiment goes through four phases:

Fit: Train the model on training data
Predict: Generate predictions on test data
Eval: Compute metrics on predictions
Done: Experiment complete

The progress.json file tracks the current phase for each experiment configuration. If interrupted, the next run resumes from the last incomplete phase.

Cross-Validation Support

For cross-validation experiments (FoldedData), progress is tracked per fold:

Each fold goes through all phases independently
The progress tracker maintains the current fold number
Interrupted experiments resume from the incomplete fold

Result Format

Results are displayed in formatted tables showing:

Algorithm: The algorithm name and configuration hash
Fold: Cross-validation fold number (if applicable, hidden if single split)
Metrics: All computed metrics at specified k values (e.g., NDCG@10, HR@20)

Example Output

MovieLens100K-a3f8e2c1: Evaluation Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Algorithm               ┃ NDCG@10 ┃ NDCG@20 ┃ HR@10  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ LensKit.ItemKNN-b7d4a9  │ 0.3245  │ 0.3891  │ 0.6142 │
│ RecBole.BPR-c8e5f1a3    │ 0.3156  │ 0.3802  │ 0.5987 │
└────────────────────────────┴─────────┴─────────┴─────────┘

For cross-validation results with multiple folds:

MovieLens100K-a3f8e2c1: Evaluation Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓
┃ Algorithm              ┃ Fold ┃ NDCG@10  ┃ HR@10 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩
│ LensKit.ItemKNN-b7d4a9 │ 0   │ 0.3201   │ 0.6089  │
│                        │ 1   │ 0.3289   │ 0.6195  │
│                        │ 2   │ 0.3245   │ 0.6142  │
└──────────────────────────┴──────┴───────────┴─────────┘

Caching and Deduplication

Experiments are cached based on dataset and configuration hashes. If you run the same experiment multiple times:

The coordinator detects identical configurations via hash comparison
Skips redundant computation (all phases marked as Done)
Reuses cached predictions and results

This ensures efficient experimentation and prevents accidental duplication of expensive training runs.

When Caching Triggers

Caching activates when: - Same dataset (same data and number of interactions) - Same algorithm and hyperparameters - Same preprocessing pipeline

Even with different variable names or execution contexts, identical experiments are recognized and deduplicated.

Log Files

stdout (out.log)

Contains standard output from runner processes, including:

Algorithm initialization messages
Training progress
Model information
Framework-specific logs

stderr (err.log)

Contains error output from runner processes, including:

Warning messages
Error traces
Framework warnings
Debugging information

Both log files are appended to across multiple experiment runs, providing a complete history of runner activity.

Debugging

If experiments fail:

Check err.log for error messages
Review out.log for algorithm output
Examine progress.json to see which phase failed
Verify dataset preprocessing and splitting