[Discussion] Efficient Row Selection for Multi-Engine Support #14816

alchemist51 · 2025-02-21T17:17:52Z

Background

We have an usecase where data is stored in multiple engines/formats and Parquet is the primary format containing all the data. While text queries are handled by inverted index format, numeric data queries and aggregations are processed via Parquet files. While the file formats are different, the data is sorted and stored in the same order across them.

We are using DataFusion to query Parquet files and wondering if the result of the query can be represented as a bit set of the document position (example below). Bit sets from the different engines can be intersected to identify the documents which meets the criteria. The resulting bit set then can be used to fetch the relevant documents from Parquet.

Example:

Assume we have the following data stored in parquet file:

colA	colB
200	Autumn leaves
200	Salty breeze
100	Misty mountains
100	Misty mountains
200	Velvet curtains

For example, assume have an query like SELECT colB where colA = 100

The matching documents can be represented in the form of bitset : 00110 (row number starts from left). We want to use the matching document information collected from any underlying engine to fetch the relevant documents in the parquet file using DataFusion.

What we explored

We explored that one of the ways to fetch specific rows in DataFusion is by creating an access plan and passing it to ParquetExec. Since we need the complete plan, we can't parallelize it and start collecting data from Parquet, which reduces the overall query performance and is also memory-inefficient as we need to iterate the complete stream and convert it to the AccessPlan.

Possible Solution

If there is a way to:

Pass the iterator directly to DataFusion, or
Process the matching rows in batches.

Then it will enable on-demand conversion from the matching rows iterator to RowSelection in DataFusion thus improving efficiency by reducing memory overhead.

Questions

Are there existing mechanisms in DataFusion to handle external iterators or row sources?
What are the best practices for integrating DataFusion with external data sources in a streaming or batched manner?
Are there any plans or ongoing work in the DataFusion project that might address this use case?
Any alternative approaches or design patterns that might help us achieve efficient row selection in our multi-engine implementation?

The text was updated successfully, but these errors were encountered:

alchemist51 · 2025-02-22T05:18:57Z

@alamb @andygrove please provide your opinion on this usecase!

alamb · 2025-02-22T12:55:19Z

Are there existing mechanisms in DataFusion to handle external iterators or row sources?

There is a PR we are currently working on related to metadata columns (which could provide row ids perhaps)

feat: metadata columns #14057

What are the best practices for integrating DataFusion with external data sources in a streaming or batched manner?

Are there any plans or ongoing work in the DataFusion project that might address this use case?

Any alternative approaches or design patterns that might help us achieve efficient row selection in our multi-engine implementation?

I think you should check out https://github.com/datafusion-contrib/datafusion-federation which has a variety of items that are used for building a federated query engine

@philippemnoel may also have ideas / suggestions for this

alamb · 2025-02-22T12:58:16Z

We are using DataFusion to query Parquet files and wondering if the result of the query can be represented as a bit set of the document position (example below). Bit sets from the different engines can be intersected to identify the documents which meets the criteria. The resulting bit set then can be used to fetch the relevant documents from Parquet.

I think there are two parts to your question:

Representing the results as a bitset: I think you would have to imlement a custom "pivot" type operation that took row ids somehow and created a bitset from them
Fetching only relevant documents from parquet: the curent reader is efficiently setup to fetch large contiguous blocks of values (RowSelection). @XiangpengHao has been thinking about a bitset representation for selected rows recently so perhaps you can help contribute to making that happen in the parquet reader

alchemist51 · 2025-02-24T07:37:57Z

Thanks for the response @alamb ! Couple of follow up questions:

There is a PR we are currently working on related to metadata columns (which could provide row ids perhaps)
#14057

Is there any way to get the row_id data for Parquet? Any suggestion to build it? @alamb @chenkovsky

Fetching only relevant documents from parquet: the curent reader is efficiently setup to fetch large contiguous blocks of values (RowSelection). @XiangpengHao has been thinking about a bitset representation for selected rows recently so perhaps you can help contribute to making that happen in the parquet reader

Will be happy to collaborate on it. @XiangpengHao any initial plan or POC you have done for it?

chenkovsky · 2025-02-24T15:11:59Z

Thanks for the response @alamb ! Couple of follow up questions:

There is a PR we are currently working on related to metadata columns (which could provide row ids perhaps)
#14057

Is there any way to get the row_id data for Parquet? Any suggestion to build it? @alamb @chenkovsky

Fetching only relevant documents from parquet: the curent reader is efficiently setup to fetch large contiguous blocks of values (RowSelection). @XiangpengHao has been thinking about a bitset representation for selected rows recently so perhaps you can help contribute to making that happen in the parquet reader

Will be happy to collaborate on it. @XiangpengHao any initial plan or POC you have done for it?

@Arpit-Bandejiya

I created an example for getting row_id for parquet based on PR #14057. https://github.com/chenkovsky/datafusion/pull/3/files

bharath-techie · 2025-02-24T17:58:48Z

Hi @chenkovsky ,
Thanks a ton for quick POC on this. :)

The row ids seems to be specific to each batch and not across the entire parquet file - is my understanding correct ?

Reason is our use case will mainly benefit from parquet file level row ids.

chenkovsky · 2025-02-25T00:08:16Z

Hi @chenkovsky ,
Thanks a ton for quick POC on this. :)

The row ids seems to be specific to each batch and not across the entire parquet file - is my understanding correct ?

Reason is our use case will mainly benefit from parquet file level row ids.

@bharath-techie yes，This is just an example, not for real situation. If it needs to meet the actual requirements, I will need more time. i think i have to learn more about parquet.

bharath-techie · 2025-02-25T05:57:11Z

Thanks @chenkovsky for confirming.

We are new to datafusion , but at high level looks like this feature will need a deeper integration in the ParquetExec flow and we might need changes in ParquetRecordBatchStream in arrow-rs as it performs pruning and at datafusion layer we might not be able to figure out the actual row ids because of it.

Experts can comment on this / see if there are any other ways that they can think of.

alchemist51 · 2025-02-25T06:30:52Z

Found this PR in arrow-rs : apache/arrow-rs#6624 . @XiangpengHao I see the PR is in draft from sometime. Is there any other way we are trying to do it? Can you please share it.

XiangpengHao · 2025-02-25T16:49:21Z

Hi @Arpit-Bandejiya sorry I've been quite busy these days.

If you have a bitmask and want to only read the flagged rows from Parquet, you can directly use ParquetRecordBatchBuilder::with_row_selection: https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection

If you want DataFusion to produce a bitmask for other systems -- I'm not aware of an easy way to do this. But this sounds like a join use case, have you considered adding a row_id column to the parquet files? so that you can select the row_id as the output and join with other systems.

DataFusion has no control over the row id read from Parquet, especially with filter pushdown, where row ids are heavily filtered. Even changing the ParquetRecordBatchStream as @bharath-techie pointed out is not enough, as concurrent reading can happen, it's possible but quite hard to determine the starting row_id of each stream.
In fact, the reader has the freedom to emit rows in any order, as long as they are logically equivalent.

alchemist51 · 2025-03-07T20:04:57Z

Thanks @XiangpengHao for the response.

If you want DataFusion to produce a bitmask for other systems -- I'm not aware of an easy way to do this. But this sounds like a join use case, have you considered adding a row_id column to the parquet files? so that you can select the row_id as the output and join with other systems.

I'm trying to do it in the same fashion of using row-id, the problem comes for sparse results in different results from engines. For example if one engine iterator is sparse while datafusion is returning almost every row it becomes quite inefficient because essentially it will end up loading all the data from datafusion. The problem aggravates a bit more since we are now fetching one more column aka row_id from the file. Few query engines like lucene support advance seek operation though I'm not sure if that is possible with datafusion or parquet file in general.

Is there any way in datafusion that we get different iterators for each of the file partitions we do when we create the physical plan? I'm thinking to divide the files into multiple partitions which can help me optimize in doing advance seek operation. For example if the next results lies in the next partition we can close the ongoing stream and process the next partition to avoid reading all pages.

Sorry for the late response, was occupied in few other stuff.

XiangpengHao · 2025-04-21T17:54:48Z

Is there any way in datafusion that we get different iterators for each of the file partitions we do when we create the physical plan?

Not sure if this is what you want, but probably relevant: https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/opener.rs#L277

cc @mbutrovich who might also be interested in this discussion

XiangpengHao mentioned this issue Apr 21, 2025

Support pushing down row selection XiangpengHao/liquid-cache#171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Efficient Row Selection for Multi-Engine Support #14816

[Discussion] Efficient Row Selection for Multi-Engine Support #14816

alchemist51 commented Feb 21, 2025 •

edited

Loading

alchemist51 commented Feb 22, 2025

alamb commented Feb 22, 2025

alamb commented Feb 22, 2025

alchemist51 commented Feb 24, 2025

chenkovsky commented Feb 24, 2025

bharath-techie commented Feb 24, 2025

chenkovsky commented Feb 25, 2025

bharath-techie commented Feb 25, 2025

alchemist51 commented Feb 25, 2025 •

edited

Loading

XiangpengHao commented Feb 25, 2025 •

edited

Loading

alchemist51 commented Mar 7, 2025 •

edited

Loading

XiangpengHao commented Apr 21, 2025

[Discussion] Efficient Row Selection for Multi-Engine Support #14816

[Discussion] Efficient Row Selection for Multi-Engine Support #14816

Comments

alchemist51 commented Feb 21, 2025 • edited Loading

Background

What we explored

Possible Solution

Questions

alchemist51 commented Feb 22, 2025

alamb commented Feb 22, 2025

alamb commented Feb 22, 2025

alchemist51 commented Feb 24, 2025

chenkovsky commented Feb 24, 2025

bharath-techie commented Feb 24, 2025

chenkovsky commented Feb 25, 2025

bharath-techie commented Feb 25, 2025

alchemist51 commented Feb 25, 2025 • edited Loading

XiangpengHao commented Feb 25, 2025 • edited Loading

alchemist51 commented Mar 7, 2025 • edited Loading

XiangpengHao commented Apr 21, 2025

alchemist51 commented Feb 21, 2025 •

edited

Loading

alchemist51 commented Feb 25, 2025 •

edited

Loading

XiangpengHao commented Feb 25, 2025 •

edited

Loading

alchemist51 commented Mar 7, 2025 •

edited

Loading