Skip to content

Conversation

Taniya-Das
Copy link
Member

@Taniya-Das Taniya-Das commented Jun 18, 2025

Metadata

  • Reference Issue:
  • New Tests Added:
  • Documentation Updated:
  • Change Log Entry:

Details

The downloaded sparse data file is trimmed by removing some rows and columns to decrease the file size for testing.

@Taniya-Das Taniya-Das changed the title Maint/to pytest test dataset open m ldatasettestsparse Maint/to pytest test dataset sparse dataset Jun 18, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jun 18, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.73%. Comparing base (6103874) to head (8a75cb4).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1418      +/-   ##
===========================================
+ Coverage    53.71%   53.73%   +0.01%     
===========================================
  Files           38       38              
  Lines         5229     5229              
===========================================
+ Hits          2809     2810       +1     
+ Misses        2420     2419       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Taniya-Das Taniya-Das marked this pull request as draft June 19, 2025 08:05
@Taniya-Das Taniya-Das marked this pull request as ready for review June 19, 2025 10:17
Copy link
Contributor

@LennartPurucker LennartPurucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

assert isinstance(X, pd.DataFrame)
assert isinstance(X.dtypes[0], pd.SparseDtype)
assert X.shape == (600, 20000)
@pytest.mark.production
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we mock the test, do we still need the mark here? I think we can remove it as long as we do not connect anymore to any server

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default it tries to connect to test server https://test.openml.org/ otherwise.
As it is just a mock, I can mock the files to test server. But this might not a be a good thing, as these datasets don't exist on test server.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We technically also have the @pytest.mark.server() marker for things that actually connect to the server. So in that way it makes sense if we update this to be consistent, @pytest.mark.production just means a production configuration, and @pytest.mark.server means an actual network operation is performed (not everything is mocked).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I am saying is that @pytest.mark.production() is needed for any production server configuration, even if it does not actually access the production (it's about how URLs are formed internally). Otherwise we would still end up with that race condition again. Either that or modify the URLs and the files to use the test server constants -- but that's not particularly clear either because the mocks are based on production data.


description_file = base_path / "description.xml"
requests_mock.get(
"https://www.openml.org/api/v1/xml/data/395",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make the API base path a fixture as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but it doesn't make a lot of difference. the generic base is just test_files_directory / "mock_responses".

@Taniya-Das Taniya-Das marked this pull request as draft June 19, 2025 14:16
@Taniya-Das Taniya-Das marked this pull request as ready for review June 19, 2025 14:54
Comment on lines +346 to +355
def test_get_sparse_categorical_data_id_395(mock_sparse_categorical_395):

dataset = openml.datasets.get_dataset(395, download_data=True)
feature = dataset.features[3758]
assert isinstance(dataset, OpenMLDataset)
assert isinstance(feature, OpenMLDataFeature)
assert dataset.name == "re1.wc"
assert feature.name == "CLASS_LABEL"
assert feature.data_type == "nominal"
assert len(feature.nominal_values) == 25
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is the only test which uses mock_sparse_categorical_395, is that correct?
If so, we can remove the data file from the repository, and remove download_data=True since it looks like we are only interested in accessing features.
And on that note, we could also remove most of the features xml file, and only keep the feature we are interested in analysing. Let me know if you have any questions about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants