-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
The current model used in DataFusion for measuring memory consumption assumes that the different entities that can consume memory (accumulators, joins, etc...) are the ones owning the data. This does not exactly match with how memory is managed in arrow-rs, where underlying data buffers might be referenced more than 1 time in different parts of DataFusion.
For example: ArrayAggAccumulator
accumulates data by storing ArrayRef
s, so even if it's accumulating and retaining data, the actual memory was allocated before ArrayAggAccumulator
came into play, and the accumulator is only adding a reference to it:
values: Vec<ArrayRef>, |
For that case, one could argue ArrayAggAccumulator
is not really consuming any memory, as it's not performing new allocations, and the underlying data was there potentially even before the ArrayAggAccumulator
was instantiated.
There has been several attempts in the past towards addressing memory accounting issues in DataFusion code
- fix: overcounting of memory in first/last. #15924
- Fix array_agg memory over use #16346
- fix: Incorrect memory accounting in
array_agg
function #16519 - Address memory over-accounting in array_agg #16816
Some imply copying/compacting just the necessary slice of data from the underlying buffer (ScalarValue::compact
) so that it's actually owned by the consumer, but in certain cases that could take a hit to performance.
The main point about this issue is to start a conversation around what could be the ideal approach for memory counting:
- Is copying/compacting accumulated data and calling
get_array_memory_size()
or storing array references and callingget_slice_memory_size()
acceptable for measuring memory consumption? - Should the memory counting model in DataFusion be expanded so that it does not take into account just memory consumed, but also memory retained because of references to shared buffers