Skip to content

feat: clean up examples and make finetune workflows config-aware#1

Closed
randyy179 wants to merge 4 commits intocodex/fix-webui-python312-depsfrom
codex/experiment-cleanup-commits
Closed

feat: clean up examples and make finetune workflows config-aware#1
randyy179 wants to merge 4 commits intocodex/fix-webui-python312-depsfrom
codex/experiment-cleanup-commits

Conversation

@randyy179
Copy link
Owner

Summary

  • reorganize the example scripts around shared helpers, saved artifacts, and clearer docs
  • make finetune scripts accept JSON config overrides and restore repository defaults for full runs
  • improve the Qlib backtest flow so runs save structured outputs instead of relying only on console logs
  • expose example datasets in the web UI for easier local demos

Testing

  • /bin/zsh -lc 'PYTHONPATH=. uv run pytest tests/test_kronos_regression.py'

Notes

Copilot AI review requested due to automatic review settings March 10, 2026 21:04
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the repository's example scripts and finetuning workflows to improve usability, maintainability, and clarity. By centralizing common utilities, introducing a robust configuration management system, and enhancing output structures, it streamlines the process of running predictions, finetuning models, and performing Qlib backtests. These changes aim to make the codebase more accessible for new users and more efficient for developers.

Highlights

  • Example Script Reorganization: Reorganized example scripts around shared helpers, saved artifacts, and clearer documentation to improve usability and maintainability.
  • Config-Aware Workflows: Implemented a robust configuration management system for finetuning and Qlib scripts, allowing parameters to be overridden via JSON files or command-line arguments.
  • Enhanced Qlib Backtesting: Improved the Qlib backtest flow to save structured outputs, including detailed performance metrics and cumulative return curves, rather than relying solely on console logs.
  • Web UI Example Data Exposure: Exposed example datasets in the Web UI for easier local demonstrations and interactive inference.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • EXPERIMENTS.md
    • Documented the four main experiment categories: Prediction Demos, Regression Tests, Fine-Tuning Pipelines, and Qlib Backtesting.
    • Provided a recommended validation sequence for the repository's workflows.
  • README.md
    • Added links to the new EXPERIMENTS.md and examples/README.md for better navigation.
  • examples/README.md
    • Provided a comprehensive guide for running example scripts, including model selection and output details.
    • Documented device auto-selection logic for Apple Silicon.
  • examples/common.py
    • Introduced examples/common.py to centralize utility functions for example scripts.
    • Implemented detect_device for automatic CUDA/MPS/CPU selection.
    • Added save_json and to_jsonable for structured output saving.
  • examples/prediction_batch_example.py
    • Converted script to use argparse for configurable model size and batch parameters.
    • Integrated examples/common.py for device detection, model specs, and output handling.
    • Enhanced output to save individual CSVs and a comprehensive JSON summary for each batch run.
  • examples/prediction_cn_markets_day.py
    • Introduced forward and eval modes for future prediction and historical replay with ground truth comparison.
    • Added command-line arguments for model selection, mode, and historical cutoff date.
    • Implemented saving of detailed JSON summaries with evaluation metrics (MAE, MSE) for 'eval' mode.
    • Improved data loading and cleaning logic for akshare data.
  • examples/prediction_example.py
    • Converted script to use argparse for configurable model size.
    • Integrated examples/common.py for device detection, model specs, and output handling.
    • Enhanced output to save a comprehensive JSON summary.
  • examples/prediction_wo_vol_example.py
    • Converted script to use argparse for configurable model size.
    • Integrated examples/common.py for device detection, model specs, and output handling.
    • Enhanced output to save a comprehensive JSON summary.
  • finetune/config.py
    • Refactored Config to use a DEFAULTS dictionary, allowing for easy configuration overrides via JSON files or direct arguments.
    • Added methods for applying overrides and refreshing derived fields.
    • Improved handling of finetuned_tokenizer_path and finetuned_predictor_path to be auto-derived unless explicitly set.
  • finetune/dataset.py
    • Modified QlibDataset constructor to take an optional Config instance, allowing dataset parameters to be dynamically set.
  • finetune/local_configs/qlib_smoke_backtest_base.example.json
    • Provided a sample configuration for Qlib backtesting with the base model, including specific time ranges and model identifiers.
  • finetune/local_configs/qlib_smoke_backtest_small.example.json
    • Provided a sample configuration for Qlib backtesting with the small model, including specific time ranges and model identifiers.
  • finetune/qlib_data_preprocess.py
    • Integrated the new Config class for managing data preprocessing parameters.
    • Added command-line argument to specify an optional JSON config override file.
  • finetune/qlib_test.py
    • Converted script to use argparse for configurable device, config file, model paths, and plot display.
    • Integrated the new Config class for managing all backtesting parameters.
    • Enhanced backtesting to save detailed performance metrics and cumulative return curves in CSV and JSON formats.
    • Added a save_json utility and analysis_frame_to_dict for structured output of Qlib analysis.
    • Improved QlibTestDataset to be config-aware.
  • finetune/train_predictor.py
    • Integrated the new Config class for managing training parameters.
    • Added command-line argument to specify an optional JSON config override file.
    • Made comet_ml import optional, printing a warning if not installed instead of failing.
  • finetune/train_tokenizer.py
    • Integrated the new Config class for managing training parameters.
    • Added command-line argument to specify an optional JSON config override file.
    • Made comet_ml import optional, printing a warning if not installed instead of failing.
  • webui/README.md
    • Added detailed installation instructions for Web UI dependencies.
    • Clarified dependency installation for shared environments.
  • webui/app.py
    • Expanded data file scanning in Web UI to include examples/data directory.
    • Added source_dir to data file metadata for better context.
Activity
  • No specific activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable refactoring across the repository, moving to a flexible, file-based configuration system for finetuning workflows, cleaning up example scripts, and enhancing Qlib backtesting. While these changes improve structure and usability, they also introduce several critical security concerns. Specifically, insecure data handling and path traversal vulnerabilities are present, with Web UI changes exposing absolute file paths that are used without validation, leading to potential path traversal and SSRF. Furthermore, example and finetuning scripts exhibit path traversal in file writing and insecure deserialization via the pickle module. These security issues must be addressed by implementing strict input validation and using safer data serialization formats. Additionally, a suggestion for improving the robustness of the new configuration class has been noted.

file_size = os.path.getsize(file_path)
data_files.append({
'name': file,
'path': file_path,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The load_data_files function returns absolute file paths to the client in the path field. These paths are subsequently used by the /api/load-data and /api/predict endpoints (which consume the output of this function) to read files using pandas.read_csv() and pandas.read_feather() without further validation. This design enables Path Traversal and SSRF, as a malicious client can provide arbitrary paths or URLs to the loading endpoints.

Recommendation: Modify the application to return only filenames to the client and reconstruct the full path on the server side using a trusted base directory and proper validation (e.g., using os.path.basename()).

Comment on lines +185 to +186
prefix = f"pred_{symbol.replace('.', '_')}_{model_key}_{mode}"
csv_path = OUTPUT_ROOT / f"{prefix}_data.csv"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The symbol parameter is used to construct file paths for output artifacts (CSV, PNG, and JSON files) without sufficient sanitization. While dots are replaced with underscores, path separators such as / or \ are not handled. An attacker could provide a malicious symbol value (e.g., ../../etc/passwd) to perform a path traversal attack and write files to arbitrary locations on the system.

Recommendation: Sanitize the symbol input by removing or replacing all path-sensitive characters. A common approach is to allow only alphanumeric characters, hyphens, and underscores.

test_data_path = Path(run_config["data_path"]) / "test_data.pkl"
print(f"Loading test data from {test_data_path}...")
with test_data_path.open("rb") as handle:
test_data = pickle.load(handle)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The script uses pickle.load() to deserialize data from a file path (test_data.pkl) located in a directory defined in the configuration. Since the configuration can be overridden by a user-provided JSON file via the --config-file argument, an attacker could potentially point the script to a malicious pickle file, leading to Remote Code Execution (RCE).

Recommendation: Avoid using pickle for loading data from potentially untrusted sources. Consider using safer alternatives like json or parquet. If pickle must be used, ensure that the files are loaded from a secure, restricted directory and that their integrity is verified.

Comment on lines +34 to +35
"n_train_iter": 2000 * 50,
"n_val_iter": 400 * 50,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These iteration counts are calculated using the hardcoded default batch_size of 50. If a user provides a config override for batch_size, n_train_iter and n_val_iter will not be updated automatically, which could be unexpected. The previous implementation where these were derived from self.batch_size was more robust.

To restore this dynamic behavior while keeping the new flexible config structure, you could calculate these values in _refresh_derived_fields if they haven't been explicitly overridden by the user. A simple way to do this is to set their default values to None and then compute them if they are None during refresh.

For example:

  1. In DEFAULTS:
"n_train_iter": None,
"n_val_iter": None,
  1. In _refresh_derived_fields():
if self.n_train_iter is None:
    self.n_train_iter = 2000 * self.batch_size
if self.n_val_iter is None:
    self.n_val_iter = 400 * self.batch_size

This would make the configuration more intuitive and less error-prone for users who only want to adjust the batch size.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reorganizes repository demos and finetuning/backtesting workflows so example runs produce structured artifacts and finetune scripts can be driven by JSON config overrides while preserving the repository’s default settings.

Changes:

  • Refactors examples/ around shared helpers, consistent output folders, and new docs/experiment overview.
  • Makes finetune + Qlib scripts accept --config-file JSON overrides and improves Qlib backtest output persistence (CSV/JSON/PNG).
  • Updates the Web UI to surface example datasets in addition to the root data/ directory.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
webui/app.py Load available datasets from both data/ and examples/data/ and include source_dir metadata.
webui/README.md Adds clearer dependency installation instructions for Web UI usage.
finetune/train_tokenizer.py Makes comet_ml optional and adds --config-file override support; passes config into dataset construction.
finetune/train_predictor.py Makes comet_ml optional and adds --config-file override support; passes config into dataset construction.
finetune/qlib_test.py Saves structured backtest artifacts (curves, metrics, summary, plot) and adds a more configurable CLI.
finetune/qlib_data_preprocess.py Adds --config-file override support for dataset preprocessing runs.
finetune/local_configs/qlib_smoke_backtest_small.example.json Provides a small-model smoke backtest config example.
finetune/local_configs/qlib_smoke_backtest_base.example.json Provides a base-model smoke backtest config example.
finetune/dataset.py Allows injecting a Config object into QlibDataset instead of always using defaults.
finetune/config.py Reworks config into defaults + JSON override + explicit overrides with derived field refresh and serialization.
examples/common.py Adds shared helpers for paths, device detection, model specs, and JSON-safe artifact writing.
examples/prediction_example.py Refactors into a CLI-driven example that saves CSV/PNG/JSON artifacts via shared helpers.
examples/prediction_wo_vol_example.py Refactors OHLC-only example to save artifacts and use shared helpers.
examples/prediction_batch_example.py Refactors batch example with CLI flags and per-window output artifacts.
examples/prediction_cn_markets_day.py Adds explicit forward/eval modes, better error handling, and structured outputs.
examples/README.md Documents example run order, model selection, outputs, and optional akshare dependency.
README.md Links to the new examples/README.md and EXPERIMENTS.md.
EXPERIMENTS.md Adds a high-level map of demos/tests/finetuning/backtesting workflows and validation sequence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 45 to +48
print(f"[Rank {rank}] Creating distributed dataloaders...")
train_dataset = QlibDataset('train')
valid_dataset = QlibDataset('val')
dataset_config = Config(overrides=config)
train_dataset = QlibDataset('train', config=dataset_config)
valid_dataset = QlibDataset('val', config=dataset_config)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_dataloaders rebuilds a Config via Config(overrides=config), but main() passes config_instance.to_dict() as config. Config.to_dict() includes derived fields like backtest_benchmark, which are not present until _refresh_derived_fields() runs, so apply_overrides() will raise AttributeError("Unknown config key: backtest_benchmark") and the training script will crash before building datasets. Consider passing a Config object through instead of a dict, filtering override keys to Config.DEFAULTS, or initializing derived keys before apply_overrides / allowing derived keys in apply_overrides.

Copilot uses AI. Check for mistakes.
Comment on lines 47 to +50
print(f"[Rank {rank}] Creating distributed dataloaders...")
train_dataset = QlibDataset('train')
valid_dataset = QlibDataset('val')
dataset_config = Config(overrides=config)
train_dataset = QlibDataset('train', config=dataset_config)
valid_dataset = QlibDataset('val', config=dataset_config)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as train_predictor.py: Config(overrides=config) is fed a dict from config_instance.to_dict(), which includes derived keys (e.g. backtest_benchmark) that are not defined until _refresh_derived_fields() runs. This will trigger AttributeError("Unknown config key: backtest_benchmark") in Config.apply_overrides() and prevent the dataloaders from being created. Prefer passing the Config instance through, or filter override keys to known defaults / support derived keys in apply_overrides.

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +35
"epochs": 30,
"log_interval": 100,
"batch_size": 50,
"n_train_iter": 2000 * 50,
"n_val_iter": 400 * 50,
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_train_iter and n_val_iter are now hard-coded as 2000 * 50 and 400 * 50. Previously they scaled with batch_size (e.g. 2000 * self.batch_size), and QlibDataset uses these values to define dataset length. If a user overrides batch_size, the iteration counts will no longer scale accordingly, changing training behavior unexpectedly. Consider recomputing these in _refresh_derived_fields() based on the current batch_size unless they were explicitly overridden.

Copilot uses AI. Check for mistakes.
@randyy179
Copy link
Owner Author

Superseded by upstream PR shiyu-coder#225, which is now based directly on upstream master.

@randyy179 randyy179 closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants