Loading Datasets
This section explains how to load and save datasets, what datasets are available and how to register and implement your own data loader.
RecSysDataSet Class
The core of the framework's data model is the RecSysDataSet class. This generic class provides a unified interface for handling different types of recommendation system datasets.
The RecSysDataSet can contain one of three different variants of data:
- RawData: Contains a single pandas DataFrame with all interactions
- SplitData: Contains train, validation, and test DataFrames
- FoldedData: Contains multiple folds, each with their own train/validation/test splits
Data Structure
All datasets follow a standardized column structure:
user: User identifier (normalized to integers starting from 0)item: Item identifier (normalized to integers starting from 0)rating: Rating value (explicit feedback) or 1 (implicit feedback)timestamp: Unix timestamp of the interaction
Using Built-in Data Loaders
The recommended way to load datasets is using the use_dataloader method with registered data loaders:
from omnirec import RecSysDataSet
from omnirec.data_loaders.datasets import DataSet
# Load MovieLens 100K dataset
dataset = RecSysDataSet.use_dataloader(DataSet.MovieLens100K)
# Force re-download and re-canonicalization
dataset = RecSysDataSet.use_dataloader(
DataSet.MovieLens100K,
force_download=True,
force_canonicalize=True
)
# Specify custom paths
dataset = RecSysDataSet.use_dataloader(
DataSet.MovieLens100K,
raw_dir="/path/to/raw/data",
canon_path="/path/to/canonicalized/data.csv"
)
The data loading process includes: 1. Download: Raw data is downloaded if not already present 2. Canonicalization: Data is cleaned and standardized: - Duplicate interactions are removed (keeping the latest) - User and item identifiers are normalized to consecutive integers - Data is saved in a standardized CSV format
Dataset Statistics
You can get basic statistics about loaded datasets:
# Get number of interactions
num_interactions = dataset.num_interactions()
# Get rating range (for explicit feedback datasets)
min_rating = dataset.min_rating()
max_rating = dataset.max_rating()
Saving and Loading Datasets
Save any RecSysDataSet object to a compressed .rsds file with the save() function:
# Save to file (extension .rsds will be added automatically)
dataset.save("my_dataset")
# Or specify full path
dataset.save("/path/to/my_dataset.rsds")
The save format preserves: - All data variants (Raw, Split, or Folded) - Metadata about the dataset - Version information for compatibility
You can load previously saved datasets with the load() function:
Data Variants
RawData
Contains all interactions in a single DataFrame:
SplitData
Contains separate train, validation, and test sets:
# Access individual splits
train_df = dataset._data.get("train")
val_df = dataset._data.get("val")
test_df = dataset._data.get("test")
FoldedData
Contains multiple folds for cross-validation:
# Access folds
for fold_idx, split_data in dataset._data.folds.items():
print(f"Fold {fold_idx}:")
print(f" Train: {len(split_data.train)} interactions")
print(f" Val: {len(split_data.val)} interactions")
print(f" Test: {len(split_data.test)} interactions")
Loading Custom Datasets
Creating Custom Data Loaders
To implement a custom data loader, create a class that inherits from Loader and implement the info() and load() function:
from pathlib import Path
import pandas as pd
from omnirec.data_loaders.base import Loader, DatasetInfo
from omnirec.data_loaders.registry import register_dataloader
class MyCustomLoader(Loader):
@staticmethod
def info(name: str) -> DatasetInfo:
return DatasetInfo(
download_urls="https://example.com/dataset.zip",
checksum="sha256_checksum_here" # Optional but recommended
)
@staticmethod
def load(source_dir: Path, name: str) -> pd.DataFrame:
# Implement your loading logic here
# Must return DataFrame with columns: user, item, rating, timestamp
df = pd.read_csv(source_dir / "data.csv")
# Process and return standardized DataFrame
return df
# Register the loader
register_dataloader("MyDataset", MyCustomLoader)
# Now you can use it
dataset = RecSysDataSet.use_dataloader("MyDataset")
Loader Registration
You can register loaders under multiple names using register_dataloader():
# Register under multiple names
register_dataloader(["Dataset1", "Dataset2", "AliasName"], MyCustomLoader)
DatasetInfo
The DatasetInfo class provides metadata about your dataset:
download_urls: URL(s) to download the dataset (string or list of strings)checksum: Optional SHA256 checksum for integrity verification
If multiple URLs are provided, they are tried in order until one succeeds.