Project Catalogs should allow check_valid and drop_duplicates = False #535

SarahG-579462 · 2025-02-26T16:30:26Z

Addressing a Problem?

When opening a large catalog (e.g. regional climate models at hourly frequency), using a project catalog can be prohibitively slow, due to check_valid needing to touch every file in the catalog.

Potential Solution

If check_valid were an option for Project Catalogs, this would be fixed.

Additional context

Trying to open a catalog with 7128 rows of ~500MB netcdfs (~2 GB uncompressed) took longer than 5 minutes, which is unnecessarily slow. The catalog itself weighs only 2.8MB.

Contribution

I would be willing/able to open a Pull Request to contribute this feature.

aulemahal · 2025-02-26T17:49:21Z

I agree with the PR and the idea, but I'm not sure I understand why you are putting the raw MRCC data into a ProjectCatalog ?

I think our design idea was that you first search in a DataCatalog and then you only put datasets you have created in the ProjectCatalog. Is there another issue that made you generate 7128 netCDFs within your project, or made it necessary to use a ProjectCatalog including raw data ?

SarahG-579462 · 2025-02-26T18:06:12Z

I agree with the PR and the idea, but I'm not sure I understand why you are putting the raw MRCC data into a ProjectCatalog ?

I think our design idea was that you first search in a DataCatalog and then you only put datasets you have created in the ProjectCatalog. Is there another issue that made you generate 7128 netCDFs within your project, or made it necessary to use a ProjectCatalog including raw data ?

I was subsetting the DataCatalog, and saving as a new catalog using ProjectCatalog, since opening/searching the MRCC5 catalog can take a while. Maybe there's a better way to do that?

SarahG-579462 added the enhancement New feature or request label Feb 26, 2025

SarahG-579462 self-assigned this Feb 26, 2025

SarahG-579462 mentioned this issue Feb 26, 2025

Project Catalog optional check_valid, drop_duplicates #536

Merged

6 tasks

SarahG-579462 closed this as completed in #536 Feb 26, 2025

SarahG-579462 closed this as completed in 7d47d9e Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Catalogs should allow check_valid and drop_duplicates = False #535

Project Catalogs should allow check_valid and drop_duplicates = False #535

SarahG-579462 commented Feb 26, 2025

aulemahal commented Feb 26, 2025

SarahG-579462 commented Feb 26, 2025

Project Catalogs should allow check_valid and drop_duplicates = False #535

Project Catalogs should allow check_valid and drop_duplicates = False #535

Comments

SarahG-579462 commented Feb 26, 2025

Addressing a Problem?

Potential Solution

Additional context

Contribution

aulemahal commented Feb 26, 2025

SarahG-579462 commented Feb 26, 2025