-
Notifications
You must be signed in to change notification settings - Fork 38
#944 longitudinal normalization #958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
… maxabs_norm and robust_scale_norm
…ehrapy into 944-longitudinal-normalization
for more information, see https://pre-commit.ci
…oved old 3d tests that only raised valueErrors
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Already looks pretty good.
- Many of my comments are repetitive so I stopped repeating them after some time 😄
- Many of your tests have tons of useless comments. Let the code speak for itself and clean up any LLM leftovers, please.
- Please also follow the comments that I make in Öyku's PRs. One of them is to improve the PR description and add some usage examples.
Just a first quick pass. I'll let @eroell have a go and then I might have a look again.
Thanks!
… properly handle NaN values
for more information, see https://pre-commit.ci
…ehrapy into 944-longitudinal-normalization
… of .R and to use decorator for 3D arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped a first intermediate review already, to be considered together with @sueoglu's :)
…e for more complicated functions that expect certain outcomes. removed unnecessary docstrings
… though. maxabs_norm and power _norm now advise the user about not usign dask arrays and correctly raise a NotImplementedError if still used. log_norm now also uses the new decorator
…FAULT_TEM_LAYER_NAME for examples
…nt about necessary rasing of NotImplementedError, moved basic tests down to precise tests, removed docstrings
eroell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test for 3D is very complex, and test things that are not 3D specific - kick things out that are not very 3D specific to make it more similar in size to the simple_impute test.
I had to dig quite a while to check some fundamental behaviors. And the group_by argument for 3D seems to not work at all:
This looks fine
edata = ed.dt.ehrdata_blobs(layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")
ep.pp.scale_norm(edata, layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")0.61
5.88
! Feature was detected as categorical features stored numerically.Please verify and adjust if necessary using `ed.replace_feature_types`.
! Feature types were inferred and stored in edata.var[feature_type]. Please verify using `ehrdata.feature_type_overview` and adjust if necessary using `ehrdata.replace_feature_types`.
-0.00
1.00
With groupby, the overall mean and std might not be exactly 0 or 1 as above. But currently, the input is not modified at all:
edata = ed.dt.ehrdata_blobs(layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")
ep.pp.scale_norm(edata, layer="tem_data", group_key="cluster")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")0.61
5.88
! Feature was detected as categorical features stored numerically.Please verify and adjust if necessary using `ed.replace_feature_types`.
! Feature types were inferred and stored in edata.var[feature_type]. Please verify using `ehrdata.feature_type_overview` and adjust if necessary using `ehrdata.replace_feature_types`.
0.61
5.88
Simplified tests focusing on the most important parts are really required here
…d minor things in examples
…ementedError for dask arrays in group wise functions. added test_norm_group_3D that also actually verifies that the data has been changed by normalization
eroell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has improved now: The function calls seem to do their job, and from what I see internally, dask never computes the full data.
I'll try to stop being picky :) But there's a few things I spotted that should be improved before we can merge this.
…dError. simplified the test logic accordingly
Zethson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more nitpicks (sorry). They'll become less and less as you make more PRs
| X = edata.X if layer is None else edata.layers[layer] | ||
|
|
||
| if np.issubdtype(X.dtype, np.integer): | ||
| X = X.astype(np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we casting X to a specific type? This is not obvious behavior, is it?
| >>> edata = ed.dt.mimic_2() | ||
| >>> edata_norm = ep.pp.scale_norm(edata, copy=True) | ||
| >>> import numpy as np | ||
| >>> edata = ed.dt.physionet2012(layer="tem_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eroell why do we need the layer parameter here?
| _raise_array_type_not_implemented(_log_norm_function, type(arr)) | ||
|
|
||
|
|
||
| @_log_norm_function.register |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could we please always add the type that we're registering here as well? See e.g. Öyku's recent PR.
… of np.abs in several cases
PR Checklist
docsis updatedDescription of changes
#944
This PR implements normalization support for 3D EHRData objects. The implementation enables all existing normalization functions to work with longitudinal data with shape
(n_obs, n_var, n_timestamps)but maintains backward compatibility with 2D data.Technical details
Treats .R as a named layer with 3D structure. Uses helper functions (
_get_target_layer,_set_target_layer, andnormalize_3d_data,_normalize_2d_data) to avoid code duplication.Each variable is processed independently by flattening the time dimension
(n_obs x n_timestamps), applying the sklearn normalization function, then reshaping to 3D.Added tests for the new functions, including group functionality and NaN cases
Examples: