Dataset Management
omnirec.recsys_data_set.RecSysDataSet(data: Optional[T] = None, meta: Optional[DatasetMeta] = None)
Bases: Generic[T]
Source code in src\omnirec\recsys_data_set.py
meta: DatasetMeta
property
Return a shallow copy of the dataset metadata.
lineage: tuple[Trace, ...]
property
Return the recorded preprocessing lineage as a read-only snapshot.
format_lineage(details: bool = False) -> str
Render the dataset lineage in either compact or detailed form.
Source code in src\omnirec\recsys_data_set.py
format_details(include_lineage: bool = True, lineage_details: bool = False) -> str
Render a human-readable summary of the dataset and its provenance.
Source code in src\omnirec\recsys_data_set.py
use_dataloader(data_set: DataSet | str, raw_dir: Optional[PathLike | str] = None, canon_path: Optional[PathLike | str] = None, force_download=False, force_canonicalize=False) -> RecSysDataSet[RawData]
staticmethod
Loads a dataset using a registered DataLoader. If not already done the data set is downloaded and canonicalized. Canonicalization means duplicates are dropped, identifiers are normalized and the data is saved in a standardized format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_set
|
DataSet | str
|
The name of the dataset from the DataSet enum. Must be a registered DataLoader name. |
required |
raw_dir
|
Optional[PathLike | str]
|
Target directory where the raw data is stored. If not provided, the data is downloaded to the default raw data directory (_DATA_DIR). |
None
|
canon_path
|
Optional[PathLike | str]
|
Path where the canonicalized data should be saved. If not provided, the data is saved to the default canonicalized data directory (_DATA_DIR / "canon"). |
None
|
force_download
|
bool
|
If True, forces re-downloading of the raw data even if it already exists. Defaults to False. |
False
|
force_canonicalize
|
bool
|
If True, forces re-canonicalization of the data even if a canonicalized file exists. Defaults to False. |
False
|
Returns:
| Type | Description |
|---|---|
RecSysDataSet[RawData]
|
RecSysDataSet[RawData]: The loaded dataset in canonicalized RawData format. |
Example
Source code in src\omnirec\recsys_data_set.py
save(file: str | PathLike)
Saves the RecSysDataSet object to a file with the default suffix .rsds.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
str | PathLike
|
The path where the file is saved. |
required |
Source code in src\omnirec\recsys_data_set.py
load(file: str | PathLike) -> RecSysDataSet[T]
staticmethod
Loads a RecSysDataSet object from a file with the .rsds suffix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
str | PathLike
|
The path to the .rsds file. |
required |
Returns:
| Type | Description |
|---|---|
RecSysDataSet[T]
|
RecSysDataSet[T]: The loaded RecSysDataSet object. |