Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Catalogs should allow check_valid and drop_duplicates = False #535

Closed
1 task done
SarahG-579462 opened this issue Feb 26, 2025 · 2 comments · Fixed by #536
Closed
1 task done

Project Catalogs should allow check_valid and drop_duplicates = False #535

SarahG-579462 opened this issue Feb 26, 2025 · 2 comments · Fixed by #536
Assignees
Labels
enhancement New feature or request

Comments

@SarahG-579462
Copy link
Contributor

Addressing a Problem?

When opening a large catalog (e.g. regional climate models at hourly frequency), using a project catalog can be prohibitively slow, due to check_valid needing to touch every file in the catalog.

Potential Solution

If check_valid were an option for Project Catalogs, this would be fixed.

Additional context

Trying to open a catalog with 7128 rows of ~500MB netcdfs (~2 GB uncompressed) took longer than 5 minutes, which is unnecessarily slow. The catalog itself weighs only 2.8MB.

Contribution

  • I would be willing/able to open a Pull Request to contribute this feature.
@aulemahal
Copy link
Collaborator

I agree with the PR and the idea, but I'm not sure I understand why you are putting the raw MRCC data into a ProjectCatalog ?

I think our design idea was that you first search in a DataCatalog and then you only put datasets you have created in the ProjectCatalog. Is there another issue that made you generate 7128 netCDFs within your project, or made it necessary to use a ProjectCatalog including raw data ?

@SarahG-579462
Copy link
Contributor Author

I agree with the PR and the idea, but I'm not sure I understand why you are putting the raw MRCC data into a ProjectCatalog ?

I think our design idea was that you first search in a DataCatalog and then you only put datasets you have created in the ProjectCatalog. Is there another issue that made you generate 7128 netCDFs within your project, or made it necessary to use a ProjectCatalog including raw data ?

I was subsetting the DataCatalog, and saving as a new catalog using ProjectCatalog, since opening/searching the MRCC5 catalog can take a while. Maybe there's a better way to do that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants