Query by iguinn · Pull Request #655 · legend-exp/pygama

iguinn · 2026-02-24T15:10:14Z

Replaced existing flow module with one that includes a set of structured queries for accessing data. This depends on legend-exp/pylegendmeta#115 and legend-exp/legend-lh5io#9

build_iterator uses query_meta from pylegendmeta to build an LH5Iterator with a set of files and hdf5 groups specified using a boolean expression
query_data builds a table data selected using boolean expressions; calls build_iterator and then runs LH5Iterator.query on that iterator
query_hist builds a histogram using data selected with boolean expressions; calls build_iterator and then runs LH5Iterator.hist on that iterator
query_evt build a table from only event tier using data selected with boolean expressions; calls query_runs, builds an iterator, and then calls query on that iterator

Note that no tests exist for these functions at this time.

Copilot

Pull request overview

This PR refactors pygama.flow away from the legacy FileDB/DataLoader-based workflow toward a new “structured query” API that builds LH5Iterator instances and exposes higher-level helpers for querying tables and producing histograms.

Changes:

Removed the legacy flow stack (DataLoader, FileDB, and related utils).
Added new query-oriented APIs: build_iterator, query_data, query_hist, and query_evt.
Updated pygama.flow public exports and module documentation to reflect the new query interface.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
src/pygama/flow/init.py	Replaces the public surface/docs from `DataLoader`/`FileDB` to query functions.
src/pygama/flow/build_iterator.py	New helper to construct an `LH5Iterator` across tiers based on metadata queries.
src/pygama/flow/query_data.py	New table query helper built on `build_iterator` + `LH5Iterator.query`.
src/pygama/flow/query_hist.py	New histogram query helper built on `build_iterator` + `LH5Iterator.hist`.
src/pygama/flow/query_evt.py	New evt-tier-only query helper using `query_runs` + direct iterator construction.
src/pygama/flow/utils.py	Removed legacy utility functions used by the old flow stack.
src/pygama/flow/file_db.py	Removed legacy `FileDB` implementation.
src/pygama/flow/data_loader.py	Removed legacy `DataLoader` implementation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-24T23:58:13Z

src/pygama/flow/query_hist.py

+    runs: str | ak.Array | Mapping[np.ndarray] | pd.DataFrame,
+    channels: str | ak.Array | Mapping[np.ndarray] | pd.DataFrame,


Copilot · 2026-02-24T23:58:14Z

src/pygama/flow/query_evt.py

+import pandas as pd
+from dbetto import Props
+from legendmeta.query import _format_vars, parse_query_paths, query_runs
+from lh5 import LH5Iterator


This imports LH5Iterator from a top-level lh5 module, but the project dependencies (pyproject.toml) don’t currently include a package that provides lh5 (the rest of the codebase uses lgdo.lh5). This is likely to fail at runtime unless the dependency change is included here.

Suggested change

from lh5 import LH5Iterator

from lgdo.lh5 import LH5Iterator

Copilot · 2026-02-24T23:58:14Z

src/pygama/flow/build_iterator.py

+import pandas as pd
+from dbetto import Props
+from legendmeta.query import _format_vars, parse_query_paths, query_meta
+from lh5 import LH5Iterator


This imports LH5Iterator from a top-level lh5 module, but pyproject.toml does not declare a dependency that provides lh5 (the codebase elsewhere uses lgdo.lh5). As-is, this is likely to fail at runtime in CI/user installs. Either switch to the existing import path (lgdo.lh5) or add/lock the new dependency providing lh5 as part of this PR.

Suggested change

from lh5 import LH5Iterator

from lgdo.lh5 import LH5Iterator

Copilot · 2026-02-24T23:58:14Z

src/pygama/flow/query_data.py

+    if executor is None and isinstance(processes, int):
+        executor = ProcessPoolExecutor(processes)
+
+    lh5_it, alias_map = build_iterator(
+        {f for f, _, _ in field_info + entries_fields},
+        runs,
+        channels,
+        dataflow_config=dataflow_config,
+        return_alias_map=True,
+        processes=processes,
+        executor=executor,
+        **kwargs,
+    )
+
+    fields = {}
+    for _, alias, path in field_info:
+        if path in alias_map:
+            fields[alias_map[path]] = None
+        else:
+            fields[path] = alias
+
+    ret = lh5_it.query(
+        entries,
+        fields=fields if not return_query_vals else None,
+        processes=processes,
+        executor=executor,
+        library=library,
+    )
+
+    if return_alias_map:
+        return ret, alias_map
+    return ret


When executor is not provided, this function creates a ProcessPoolExecutor but never shuts it down. That can leave worker processes running and hang interpreters/tests. If you create the executor internally, ensure it’s shut down (e.g., use a context manager or call shutdown() in a finally block after lh5_it.query(...)).

Suggested change

if executor is None and isinstance(processes, int):

executor = ProcessPoolExecutor(processes)

lh5_it, alias_map = build_iterator(

{f for f, _, _ in field_info + entries_fields},

runs,

channels,

dataflow_config=dataflow_config,

return_alias_map=True,

processes=processes,

executor=executor,

**kwargs,

)

fields = {}

for _, alias, path in field_info:

if path in alias_map:

fields[alias_map[path]] = None

else:

fields[path] = alias

ret = lh5_it.query(

entries,

fields=fields if not return_query_vals else None,

processes=processes,

executor=executor,

library=library,

)

if return_alias_map:

return ret, alias_map

return ret

created_executor = False

if executor is None and isinstance(processes, int):

executor = ProcessPoolExecutor(processes)

created_executor = True

try:

lh5_it, alias_map = build_iterator(

{f for f, _, _ in field_info + entries_fields},

runs,

channels,

dataflow_config=dataflow_config,

return_alias_map=True,

processes=processes,

executor=executor,

**kwargs,

)

fields = {}

for _, alias, path in field_info:

if path in alias_map:

fields[alias_map[path]] = None

else:

fields[path] = alias

ret = lh5_it.query(

entries,

fields=fields if not return_query_vals else None,

processes=processes,

executor=executor,

library=library,

)

if return_alias_map:

return ret, alias_map

return ret

finally:

if created_executor and executor is not None:

executor.shutdown(wait=True)

Copilot · 2026-02-24T23:58:14Z

src/pygama/flow/query_hist.py

+            {
+                "cuspEmax_ctc_cal": axis.Regular(300, 0, 3000, label="Energy (keV)"),
+                "@det.name": axis.StrCategory(label="Detector", growth=True)"
+            }


Docstring example has a stray trailing quote and several examples in this file end with mismatched backticks (e.g., ``...````). These will render incorrectly in the generated docs and can produce Sphinx warnings. Please fix the quoting/backtick markup in the examples.

Copilot · 2026-02-24T23:58:16Z

src/pygama/flow/query_evt.py

+                lh5_it = new_it
+            else:
+                lh5_it.add_friend(new_it)
+


lh5_it can remain None if no suitable tiers are found/kept (e.g., tiers filters everything or all tier table formats include channel variables and are skipped). The unconditional lh5_it.reset_field_mask(...) will then raise AttributeError. Add a check after the loop and raise a clear error when no iterator could be built.

Suggested change

if lh5_it is None:

msg = (

"no suitable LH5 tiers were found for the requested fields; "

"check 'tiers', 'tables', and requested field paths"

)

raise ValueError(msg)

Copilot · 2026-02-24T23:58:16Z

src/pygama/flow/__init__.py

+from legendmeta.query import query_meta, query_runs
+
+from .build_iterator import build_iterator
+from .query_data import query_data
+from .query_evt import query_evt
+from .query_hist import query_hist

-__all__ = ["DataLoader", "FileDB"]
+__all__ = [
+    "query_runs",
+    "query_meta",
+    "query_data",
+    "query_hist",
+    "query_evt",
+    "build_iterator",
+]


This module no longer exports DataLoader and FileDB, and the corresponding implementation files were removed. The repository’s test suite still imports these symbols (tests/flow/test_data_loader.py, tests/flow/test_filedb.py), so CI will fail unless tests (and any downstream API expectations) are updated or a deprecation/compatibility layer is provided.

Copilot · 2026-02-24T23:58:16Z

src/pygama/flow/build_iterator.py

+    runs: str | ak.Array | Mapping[np.ndarray] | pd.DataFrame,
+    channels: str | ak.Array | Mapping[np.ndarray] | pd.DataFrame,


Copilot · 2026-02-24T23:58:16Z

src/pygama/flow/build_iterator.py

+    *,
+    dataflow_config: Path | str | Mapping = "$REFPROD/dataflow-config.yaml",
+    tiers: Collection[str] = None,
+    tables: Collection[str] = None,


tables is annotated as Collection[str] but is used as a mapping (tables.items() and tables[tier]). This mismatch can hide real runtime errors. Update the type to something like Mapping[str, str] and consider validating that required tier keys exist before indexing to avoid KeyError when paths contains tiers not present in tables.

Suggested change

tables: Collection[str] = None,

tables: Mapping[str, str] | None = None,

Copilot · 2026-02-24T23:58:17Z

src/pygama/flow/query_data.py

+    runs: str | ak.Array | Mapping[np.ndarray] | pd.DataFrame,
+    channels: str | ak.Array | Mapping[np.ndarray] | pd.DataFrame,


iguinn and others added 4 commits November 17, 2025 09:45

Added query_meta

f24d918

style: pre-commit fixes

a2b33b1

Merge branch 'main' of https://github.com/legend-exp/pygama into query

8e138ea

Added query functions to flow

8d1a282

iguinn requested review from ggmarshall and gipert February 24, 2026 19:38

ggmarshall requested a review from Copilot February 24, 2026 23:51

Copilot started reviewing on behalf of ggmarshall February 24, 2026 23:52 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query#655

Query#655
iguinn wants to merge 4 commits intolegend-exp:mainfrom
iguinn:query

iguinn commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		runs: str \| ak.Array \| Mapping[np.ndarray] \| pd.DataFrame,
		channels: str \| ak.Array \| Mapping[np.ndarray] \| pd.DataFrame,

+    if lh5_it is None:
+        msg = (
+            "no suitable LH5 tiers were found for the requested fields; "
+            "check 'tiers', 'tables', and requested field paths"
+        )
+        raise ValueError(msg)

	tables: Collection[str] = None,
	tables: Mapping[str, str] \| None = None,

Conversation

iguinn commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants