Skip to content

Security Issue: Unsafe deserialization in QlibDataset #216

@Doria77486

Description

@Doria77486

Description

The QlibDataset class loads training/validation data from pickle files (train_data.pkl or val_data.pkl) using Python's pickle.load. This poses a remote code execution (RCE) risk if an attacker can supply a malicious .pkl file.

class QlibDataset(Dataset):
    def __init__(self, data_type: str = 'train'):
        self.config = Config()
        # codes...
        if data_type == 'train':
            # self.config.dataset_path = "./data/processed_datasets"
            self.data_path = f"{self.config.dataset_path}/train_data.pkl"
            self.n_samples = self.config.n_train_iter
        else:
            self.data_path = f"{self.config.dataset_path}/val_data.pkl"
            self.n_samples = self.config.n_val_iter

        with open(self.data_path, 'rb') as f:
            self.data = pickle.load(f)

Poc

train_dataset = QlibDataset(data_type='train')

By placing a malicious train_data.pkl inside ./data/processed_datasets/, simply instantiating QlibDataset will execute arbitrary code.

Security Impact

If users download or reuse untrusted or third-party datasets, a malicious pickle file can lead to arbitrary code execution at dataset loading time.
This is especially dangerous in scenarios where datasets are shared, downloaded automatically, or reused across environments.

Recommendation

  • If pickle must be used, consider adding a clear security warning indicating that only trusted datasets should be loaded.
  • It is recommended to introduce a mechanism similar to trust_code (e.g., trust_dataset or trust_pickle) that requires users to explicitly acknowledge the risk before loading pickle-based datasets

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions