Skip to content

Conversation

cmclean
Copy link

@cmclean cmclean commented Sep 13, 2025

Describe your changes

This PR adds the top-performing "BBKNN (TS)" method from our recent preprint [1]. I have verified that tests pass, and running the run_full_local.sh script on my own machine successfully recapitulates the numbers published in the preprint.

[1] Aygun et al, An AI system to help scientists write expert-level empirical software, arXiv:2509.06503 (2025), https://arxiv.org/abs/2509.06503 .

Checklist before requesting a review

  • I have performed a self-review of my code

  • Check the correct box. Does this PR contain:

    • Breaking changes
    • New functionality
    • Major changes
    • Minor changes
    • Bug fixes
  • Proposed changes are described in the CHANGELOG.md

  • CI Tests succeed and look good!

@cmclean
Copy link
Author

cmclean commented Sep 15, 2025

I have a couple questions about the submission and testing:

  1. Should my new 'bbknn_ts' method also be registered into the following two places?
  • methods = [...] block of src/workflows/run_benchmark/main.nf (lines 10-46)
  • dependencies: block of src/workflows/run_benchmark/config.vsh.yaml (lines 79-125)
  1. (For my own edification) When I ran run_full_local.sh, the method completed successfully on all datasets and metrics except computing kbet on the hypomap dataset, where it OOMed. That was expected since it OOMed there on our internal evaluation harness, and many methods fail on that particular (dataset, method). To get the final score_uns.yaml file though I had to rerun while dropping the kbet metric (for all datasets). Is there a way to skip just a specified set of (dataset, method) jobs or ignore errors and finish the workflow just with what was computed properly? As is I was not able to validate identical kbet performance metrics, though given the identical results for all other metrics on all datasets am confident in its reproducibility.

Thanks!

@lazappi
Copy link
Member

lazappi commented Sep 17, 2025

1. Should my new 'bbknn_ts' method also be registered into the following two places?

* `methods = [...]` block of src/workflows/run_benchmark/main.nf (lines 10-46)
* `dependencies:` block of src/workflows/run_benchmark/config.vsh.yaml (lines 79-125)

Yes, it needs to be added there to be included in the workflow. It sounds like you ran the workflow already but I'm not sure that it would include the new method without doing this.

2. (For my own edification) When I ran `run_full_local.sh`, the method completed successfully on all datasets and metrics except computing `kbet` on the `hypomap` dataset, where it OOMed. That was expected since it OOMed there on our internal evaluation harness, and many methods fail on that particular (dataset, method). To get the final `score_uns.yaml` file though I had to rerun while dropping the `kbet` metric (for all datasets). Is there a way to skip just a specified set of (dataset, method) jobs or ignore errors and finish the workflow just with what was computed properly? As is I was not able to validate identical `kbet` performance metrics, though given the identical results for all other metrics on all datasets am confident in its reproducibility.

When we do the full benchmark run on the cloud any failed metrics are ignored (or more accurately given a score of zero). We don't usually do the full runs locally so there might be some differences in the settings that causes it to not produce an output. Generally we wouldn't want to disable a metric just for a specific dataset/method.

@lazappi lazappi requested a review from mumichae September 17, 2025 12:55
@cmclean
Copy link
Author

cmclean commented Sep 19, 2025

Thank you for the responses. I noticed that the drvi method addition (#61) did not add to the src/workflows/run_benchmark/ files so didn't put bbknn_ts there but will in a follow-up commit in this PR.

Regarding my other question, this was ultimately just a vanilla NextFlow question, I've now figured out how to pipe errorStrategy 'ignore' into the right place. Thank you!

@cmclean
Copy link
Author

cmclean commented Sep 22, 2025

Okay this PR is ready for review. @mumichae please note that src/methods/bbknn_ts/script.py lines 18-226 does not require detailed review comments, this is an entirely LLM-generated function that we do not want to modify.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on a quick glance, this code will likely work as the developer intended, however it does not make use of the pre-computed processing step and recomputes everything by itself insead. This in itself won't cause the code to fail or give wrong results, however it won't follow the benchmark setup as intended

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a little on the comment that "it won't follow the benchmark setup as intended"? Is that because we indicate this is an embedding method with preferred normalization of log_cp10k but use the "layers/counts" as input rather than "layers/normalized"?

Would it be considered to follow the benchmark setup as intended if we indicated it is a [feature] method and set adata_integrated.layers['corrected_counts'] = adata_integrated.X? IIUC that could only improve its overall score on the v2.0.0 benchmark as then the HVG metric would also be computed.

I'd rather not edit the code to do that since right now it's purely LLM implemented, just want to make sure we're not missing something more fundamental. Thanks!

@cmclean cmclean requested a review from mumichae September 29, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants