feat: clean up examples and make finetune workflows config-aware#1
feat: clean up examples and make finetune workflows config-aware#1randyy179 wants to merge 4 commits intocodex/fix-webui-python312-depsfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the repository's example scripts and finetuning workflows to improve usability, maintainability, and clarity. By centralizing common utilities, introducing a robust configuration management system, and enhancing output structures, it streamlines the process of running predictions, finetuning models, and performing Qlib backtests. These changes aim to make the codebase more accessible for new users and more efficient for developers. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and valuable refactoring across the repository, moving to a flexible, file-based configuration system for finetuning workflows, cleaning up example scripts, and enhancing Qlib backtesting. While these changes improve structure and usability, they also introduce several critical security concerns. Specifically, insecure data handling and path traversal vulnerabilities are present, with Web UI changes exposing absolute file paths that are used without validation, leading to potential path traversal and SSRF. Furthermore, example and finetuning scripts exhibit path traversal in file writing and insecure deserialization via the pickle module. These security issues must be addressed by implementing strict input validation and using safer data serialization formats. Additionally, a suggestion for improving the robustness of the new configuration class has been noted.
| file_size = os.path.getsize(file_path) | ||
| data_files.append({ | ||
| 'name': file, | ||
| 'path': file_path, |
There was a problem hiding this comment.
The load_data_files function returns absolute file paths to the client in the path field. These paths are subsequently used by the /api/load-data and /api/predict endpoints (which consume the output of this function) to read files using pandas.read_csv() and pandas.read_feather() without further validation. This design enables Path Traversal and SSRF, as a malicious client can provide arbitrary paths or URLs to the loading endpoints.
Recommendation: Modify the application to return only filenames to the client and reconstruct the full path on the server side using a trusted base directory and proper validation (e.g., using os.path.basename()).
| prefix = f"pred_{symbol.replace('.', '_')}_{model_key}_{mode}" | ||
| csv_path = OUTPUT_ROOT / f"{prefix}_data.csv" |
There was a problem hiding this comment.
The symbol parameter is used to construct file paths for output artifacts (CSV, PNG, and JSON files) without sufficient sanitization. While dots are replaced with underscores, path separators such as / or \ are not handled. An attacker could provide a malicious symbol value (e.g., ../../etc/passwd) to perform a path traversal attack and write files to arbitrary locations on the system.
Recommendation: Sanitize the symbol input by removing or replacing all path-sensitive characters. A common approach is to allow only alphanumeric characters, hyphens, and underscores.
| test_data_path = Path(run_config["data_path"]) / "test_data.pkl" | ||
| print(f"Loading test data from {test_data_path}...") | ||
| with test_data_path.open("rb") as handle: | ||
| test_data = pickle.load(handle) |
There was a problem hiding this comment.
The script uses pickle.load() to deserialize data from a file path (test_data.pkl) located in a directory defined in the configuration. Since the configuration can be overridden by a user-provided JSON file via the --config-file argument, an attacker could potentially point the script to a malicious pickle file, leading to Remote Code Execution (RCE).
Recommendation: Avoid using pickle for loading data from potentially untrusted sources. Consider using safer alternatives like json or parquet. If pickle must be used, ensure that the files are loaded from a secure, restricted directory and that their integrity is verified.
| "n_train_iter": 2000 * 50, | ||
| "n_val_iter": 400 * 50, |
There was a problem hiding this comment.
These iteration counts are calculated using the hardcoded default batch_size of 50. If a user provides a config override for batch_size, n_train_iter and n_val_iter will not be updated automatically, which could be unexpected. The previous implementation where these were derived from self.batch_size was more robust.
To restore this dynamic behavior while keeping the new flexible config structure, you could calculate these values in _refresh_derived_fields if they haven't been explicitly overridden by the user. A simple way to do this is to set their default values to None and then compute them if they are None during refresh.
For example:
- In
DEFAULTS:
"n_train_iter": None,
"n_val_iter": None,- In
_refresh_derived_fields():
if self.n_train_iter is None:
self.n_train_iter = 2000 * self.batch_size
if self.n_val_iter is None:
self.n_val_iter = 400 * self.batch_sizeThis would make the configuration more intuitive and less error-prone for users who only want to adjust the batch size.
There was a problem hiding this comment.
Pull request overview
This PR reorganizes repository demos and finetuning/backtesting workflows so example runs produce structured artifacts and finetune scripts can be driven by JSON config overrides while preserving the repository’s default settings.
Changes:
- Refactors
examples/around shared helpers, consistent output folders, and new docs/experiment overview. - Makes finetune + Qlib scripts accept
--config-fileJSON overrides and improves Qlib backtest output persistence (CSV/JSON/PNG). - Updates the Web UI to surface example datasets in addition to the root
data/directory.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| webui/app.py | Load available datasets from both data/ and examples/data/ and include source_dir metadata. |
| webui/README.md | Adds clearer dependency installation instructions for Web UI usage. |
| finetune/train_tokenizer.py | Makes comet_ml optional and adds --config-file override support; passes config into dataset construction. |
| finetune/train_predictor.py | Makes comet_ml optional and adds --config-file override support; passes config into dataset construction. |
| finetune/qlib_test.py | Saves structured backtest artifacts (curves, metrics, summary, plot) and adds a more configurable CLI. |
| finetune/qlib_data_preprocess.py | Adds --config-file override support for dataset preprocessing runs. |
| finetune/local_configs/qlib_smoke_backtest_small.example.json | Provides a small-model smoke backtest config example. |
| finetune/local_configs/qlib_smoke_backtest_base.example.json | Provides a base-model smoke backtest config example. |
| finetune/dataset.py | Allows injecting a Config object into QlibDataset instead of always using defaults. |
| finetune/config.py | Reworks config into defaults + JSON override + explicit overrides with derived field refresh and serialization. |
| examples/common.py | Adds shared helpers for paths, device detection, model specs, and JSON-safe artifact writing. |
| examples/prediction_example.py | Refactors into a CLI-driven example that saves CSV/PNG/JSON artifacts via shared helpers. |
| examples/prediction_wo_vol_example.py | Refactors OHLC-only example to save artifacts and use shared helpers. |
| examples/prediction_batch_example.py | Refactors batch example with CLI flags and per-window output artifacts. |
| examples/prediction_cn_markets_day.py | Adds explicit forward/eval modes, better error handling, and structured outputs. |
| examples/README.md | Documents example run order, model selection, outputs, and optional akshare dependency. |
| README.md | Links to the new examples/README.md and EXPERIMENTS.md. |
| EXPERIMENTS.md | Adds a high-level map of demos/tests/finetuning/backtesting workflows and validation sequence. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| print(f"[Rank {rank}] Creating distributed dataloaders...") | ||
| train_dataset = QlibDataset('train') | ||
| valid_dataset = QlibDataset('val') | ||
| dataset_config = Config(overrides=config) | ||
| train_dataset = QlibDataset('train', config=dataset_config) | ||
| valid_dataset = QlibDataset('val', config=dataset_config) |
There was a problem hiding this comment.
create_dataloaders rebuilds a Config via Config(overrides=config), but main() passes config_instance.to_dict() as config. Config.to_dict() includes derived fields like backtest_benchmark, which are not present until _refresh_derived_fields() runs, so apply_overrides() will raise AttributeError("Unknown config key: backtest_benchmark") and the training script will crash before building datasets. Consider passing a Config object through instead of a dict, filtering override keys to Config.DEFAULTS, or initializing derived keys before apply_overrides / allowing derived keys in apply_overrides.
| print(f"[Rank {rank}] Creating distributed dataloaders...") | ||
| train_dataset = QlibDataset('train') | ||
| valid_dataset = QlibDataset('val') | ||
| dataset_config = Config(overrides=config) | ||
| train_dataset = QlibDataset('train', config=dataset_config) | ||
| valid_dataset = QlibDataset('val', config=dataset_config) |
There was a problem hiding this comment.
Same issue as train_predictor.py: Config(overrides=config) is fed a dict from config_instance.to_dict(), which includes derived keys (e.g. backtest_benchmark) that are not defined until _refresh_derived_fields() runs. This will trigger AttributeError("Unknown config key: backtest_benchmark") in Config.apply_overrides() and prevent the dataloaders from being created. Prefer passing the Config instance through, or filter override keys to known defaults / support derived keys in apply_overrides.
| "epochs": 30, | ||
| "log_interval": 100, | ||
| "batch_size": 50, | ||
| "n_train_iter": 2000 * 50, | ||
| "n_val_iter": 400 * 50, |
There was a problem hiding this comment.
n_train_iter and n_val_iter are now hard-coded as 2000 * 50 and 400 * 50. Previously they scaled with batch_size (e.g. 2000 * self.batch_size), and QlibDataset uses these values to define dataset length. If a user overrides batch_size, the iteration counts will no longer scale accordingly, changing training behavior unexpectedly. Consider recomputing these in _refresh_derived_fields() based on the current batch_size unless they were explicitly overridden.
|
Superseded by upstream PR shiyu-coder#225, which is now based directly on upstream master. |
Summary
Testing
Notes
master, rebase this branch ontoupstream/masterand open the upstream PR againstmaster.