-
Notifications
You must be signed in to change notification settings - Fork 69
fix: Accept X and y as positional argument with as_dict=True in train_test_split #1570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Accept X and y as positional argument with as_dict=True in train_test_split #1570
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some tests? It would also help to demonstrate what the new behaviour looks like.
@auguste-probabl |
skore/tests/unit/sklearn/train_test_split/test_train_test_split.py
Outdated
Show resolved
Hide resolved
if y is not None: | ||
new_arrays.append(y) | ||
keys += ["y"] | ||
|
||
if as_dict and arrays: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you remove this? If both are true, it will cause a conflict.
|
||
new_arrays = list(keyword_arrays.values()) | ||
|
||
if X is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this repeated inside and outside the if? It is redundant.
if X is not None: | ||
new_arrays.append(X) | ||
keys.append("X") | ||
if y is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this repeated inside and outside the if? It is redundant.
@@ -167,21 +177,20 @@ class labels. | |||
stratify=stratify, | |||
) | |||
|
|||
if X is None: | |||
X = arrays[0] if len(arrays) == 1 else arrays[-2] | |||
if X is None and len(arrays) >= 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect value for case when len(arrays)>1
?
…thub.com/amltarek/skore into fix-Passing-all-datasets-by-keyword-1544
@auguste-probabl I have implemented the changes as you requested and also tested it. Additionally, I modified another function, test_train_test_split_dict_kwargs(), because it was throwing an error when data was passed without keyword arguments while return_dict=True. This issue has now been fixed. Furthermore, I addressed all the changes requested by @nkapila6. |
skore/tests/unit/sklearn/train_test_split/test_train_test_split.py
Outdated
Show resolved
Hide resolved
skore/tests/unit/sklearn/train_test_split/test_train_test_split.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test with
arr1 = [[1]] * 20
arr2 = [0] * 10 + [1] * 10
train_test_split(arr2, z=arr1, as_dict=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is tested through two functions named( test_train_test_split_check_dict()) and test_train_test_split_dict_kwargs().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit different: right now we test either all arguments passed by keyword, or all arguments passed by position. I'd like to also test the combination of both (one array passed by position, one array passed by keyword).
skore/tests/unit/sklearn/train_test_split/test_train_test_split.py
Outdated
Show resolved
Hide resolved
skore/tests/unit/sklearn/train_test_split/test_train_test_split.py
Outdated
Show resolved
Hide resolved
skore/tests/unit/sklearn/train_test_split/test_train_test_split.py
Outdated
Show resolved
Hide resolved
I have removed all the duplicate functions. Can you check the code? @auguste-probabl |
result = train_test_split( | ||
X=X, | ||
y=y, | ||
sample_weights=weights, | ||
test_size=0.2, | ||
as_dict=True, | ||
random_state=0, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test feels a bit redundant, but
train_test_split(
X,
y,
sample_weights=weights,
...
)
(i.e. a mix of positional and keyword arguments) would be interesting. See also my other comment
@amltarek Can you resolve comments when it's clear that you have addressed them? It helps review your code. |
Also please sign your commits |
Coverage Report for backend
|
@auguste-probabl Can you check the updates? |
def test_empty_input(): | ||
"""Tests that passing empty lists for X and y raises a ValueError.""" | ||
X = [] | ||
y = [] | ||
with pytest.raises(ValueError): | ||
train_test_split(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this test is needed; this behaviour is not specific to our function, but rather to sklearn's.
assert "X_train" in result | ||
assert "X_test" in result | ||
assert "y_train" in result | ||
assert "y_test" in result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this
>>> # When using positional arguments and as_dict=True | ||
>>> # the first argument is assumed to be X, the second y | ||
>>> train_test_split( | ||
... [[1], [2], [3], [4]], [0, 1, 0, 1], as_dict=True, random_state=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... [[1], [2], [3], [4]], [0, 1, 0, 1], as_dict=True, random_state=0 | |
... [[1], [2], [3], [4]], [0, 1, 0, 1], as_dict=True |
No need for random_state
since we don't check the output arrays. You can also remove random_state
in the previous doctest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also test what happens with
train_test_split(X, X=X)
I think there should be an error like
X cannot be passed both by position and by keyword.
Same for y.
closes #1544
I fixed the issue by updating the handling of keyword arguments when as_dict=True is used. Now, if all datasets are passed as keywords, the function directly returns them as a dictionary without extra processing. This makes the behavior more intuitive and avoids redundancy. I also tested the fix through test_train_split_test.py to ensure it works correctly.
