-
-
Notifications
You must be signed in to change notification settings - Fork 155
Maint/to pytest test dataset sparse dataset #1418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Maint/to pytest test dataset sparse dataset #1418
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1418 +/- ##
===========================================
+ Coverage 53.71% 53.73% +0.01%
===========================================
Files 38 38
Lines 5229 5229
===========================================
+ Hits 2809 2810 +1
+ Misses 2420 2419 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments
assert isinstance(X, pd.DataFrame) | ||
assert isinstance(X.dtypes[0], pd.SparseDtype) | ||
assert X.shape == (600, 20000) | ||
@pytest.mark.production |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we mock the test, do we still need the mark here? I think we can remove it as long as we do not connect anymore to any server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default it tries to connect to test server https://test.openml.org/
otherwise.
As it is just a mock, I can mock the files to test server. But this might not a be a good thing, as these datasets don't exist on test server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We technically also have the @pytest.mark.server()
marker for things that actually connect to the server. So in that way it makes sense if we update this to be consistent, @pytest.mark.production just means a production configuration, and @pytest.mark.server means an actual network operation is performed (not everything is mocked).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what I am saying is that @pytest.mark.production() is needed for any production server configuration, even if it does not actually access the production (it's about how URLs are formed internally). Otherwise we would still end up with that race condition again. Either that or modify the URLs and the files to use the test server constants -- but that's not particularly clear either because the mocks are based on production data.
|
||
description_file = base_path / "description.xml" | ||
requests_mock.get( | ||
"https://www.openml.org/api/v1/xml/data/395", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make the API base path a fixture as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but it doesn't make a lot of difference. the generic base is just test_files_directory / "mock_responses"
.
def test_get_sparse_categorical_data_id_395(mock_sparse_categorical_395): | ||
|
||
dataset = openml.datasets.get_dataset(395, download_data=True) | ||
feature = dataset.features[3758] | ||
assert isinstance(dataset, OpenMLDataset) | ||
assert isinstance(feature, OpenMLDataFeature) | ||
assert dataset.name == "re1.wc" | ||
assert feature.name == "CLASS_LABEL" | ||
assert feature.data_type == "nominal" | ||
assert len(feature.nominal_values) == 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is the only test which uses mock_sparse_categorical_395
, is that correct?
If so, we can remove the data file from the repository, and remove download_data=True
since it looks like we are only interested in accessing features.
And on that note, we could also remove most of the features xml file, and only keep the feature we are interested in analysing. Let me know if you have any questions about it.
Metadata
Details
The downloaded sparse data file is trimmed by removing some rows and columns to decrease the file size for testing.