Skip to content

Add documented CSV data formats and examples#17

Merged
choutkaj merged 9 commits into
mainfrom
feature/data-format-imports
May 11, 2026
Merged

Add documented CSV data formats and examples#17
choutkaj merged 9 commits into
mainfrom
feature/data-format-imports

Conversation

@choutkaj

Copy link
Copy Markdown
Owner

Summary

This PR adds the two official user-facing CSV input formats for DoseResponseData:

  • long: the canonical long-form format used internally by BindCurve.
  • replicate_wide: a spreadsheet-friendly format where each row is one compound, independent experiment, and concentration, with technical replicates stored in response_* columns.

Changes

  • Adds data_formats.md in the repository root with documentation for both formats.
  • Extends DoseResponseData.from_csv() with an explicit format argument supporting long and replicate_wide.
  • Keeps format="long" as the default so existing from_csv(path) behavior remains compatible.
  • Adds four synthetic example CSV files in the repository root:
    • synthetic_direct_binding_long.csv
    • synthetic_direct_binding_replicate_wide.csv
    • synthetic_competitive_binding_long.csv
    • synthetic_competitive_binding_replicate_wide.csv
  • Adds tests for loading both formats, checking normalization, and validating error handling.

Notes

The feature branch is currently behind main because main moved after the branch point, but the diff is limited to the data-format implementation, docs, examples, and tests.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2a88b3614

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +260 to +264
columns={
source: target
for source, target in rename_map.items()
if source != target
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent duplicate canonical columns when normalizing long CSVs

When a caller provides custom long-format column names that map onto canonical names already present in the CSV (for example compound_col="cmpd" while compound_id also exists), this rename step can create duplicate labels like two compound_id columns. Downstream normalization/validation then operates on ambiguous columns and can raise runtime errors (e.g., grouping on compound_id becomes non-1D) or use unintended data. This is a regression in the new from_csv(..., format="long") path and should be handled by detecting collisions before rename and failing with a clear error (or resolving precedence explicitly).

Useful? React with 👍 / 👎.

@choutkaj choutkaj merged commit 8a585be into main May 11, 2026
0 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant