HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… #6089

ramitg254 · 2025-09-20T11:56:31Z

… batch retrieve is enabled

What changes were proposed in this pull request?

currently flow of query when hive.metastore.direct.sql.batch.size>0 is:
aggrColStatsForPartitions -> columnStatisticsObjForPartitions -> columnStatisticsObjForPartitionsBatch -> aggrStatsUseJava -> getPartitionStats
in this case columnStatisticsObjForPartitions is also apply batching on partitions list which is not merged further and also this batching for partitions list is not needed as getPartitionStats also applies batching on partitions list which gets merged after batching and get returned via columnStatisticsObjForPartitions eventually preventing any redundant entry for a given column Name

Why are the changes needed?

currently in case hive.metastore.direct.sql.batch.size>0 then List<ColumnStatisticsObj> returned from columnStatisticsObjForPartitions consists of multiple entries with similar column name on which merging is not performed resulting in wrong stats.

Does this PR introduce any user-facing change?

No

How was this patch tested?

by running tpcds tests locally with setting hive.metastore.direct.sql.batch.size as 1000 in HiveConf.java and MetastoreConf.java.

… batch retrieve is enabled

ramitg254 · 2025-09-20T18:08:52Z

dropped the batched test added earlier as set hive.metastore.direct.sql.batch.size=1000 was not changing the value in MetastoreConf in the real time run of q test.
but an individual tpcds test can be checked with:
mvn test -pl itests/qtest -Pitests -Dtest=TestTezTPCDS30TBPerfCliDriver -Dqfile=query16.q -Dhive.metastore.direct.sql.batch.size=1000

…abled

…are disabled

dengzhhu653 · 2025-09-24T09:54:22Z

...tore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java

-                enableBitVector, enableKll);
-          }
-        });
+        return columnStatisticsObjForPartitionsBatch(catName, dbName, tableName, partNames, inputColNames, engine,


nit: can we just aggrStatsUseJava directly here and remove the columnStatisticsObjForPartitionsBatch

dengzhhu653 · 2025-09-24T09:59:46Z

...tore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java

        areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
  }

-  private List<ColumnStatisticsObj> aggrStatsUseDB(String catName, String dbName,


+1 for removing the aggrStatsUseDB, if we enable the batch retrieval, then the stats might not be aggregated per column. If we don't, we might hit the limitation of maximum parameters for PreparedStatement for some dbs.

Let's see how the test going

do i get it right. we are moving stats aggregation from backend db to Java? what would be the impact on performance?
someone was working on this optimization and now we drop it?

aggrStatsUseDB can only be used if hive.stats.fetch.bitvector and metastore.stats.fetch.kll are false. Some tests enable them via set command in q files or entire tests suits via hive-site but the default is false. Aggregating the stats at the backend db is usually faster then doing it in java so we can lose performance with this patch in some cases.

@ramitg254

Could you please investigate how to aggregate the results of subsequent aggrStatsUseDB calls?

Seems that we have test coverage for this: TestObjectStore.java.testAggrStatsUseDB Should it be removed along with aggrStatsUseDB?

@dengzhhu653

If we don't, we might hit the limitation of maximum parameters for PreparedStatement for some dbs.

Let's see how the test going

Do we have tests using other dbs than derby and the postgres image?

@kasakrisz I have added aggregation for aggrStatsUseDB earlier upto the commit b566816 but on later suggestion I removed it

sonarqubecloud · 2025-09-25T06:46:00Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

dengzhhu653 · 2025-09-25T09:23:40Z

+1. cc @kasakrisz @deniskuzZ @nrg4878 @saihemanth-cloudera @zhangbutao in case if have any other ideas.

zhangbutao · 2025-09-30T09:56:46Z

How was this patch tested?

by running tpcds tests locally with setting hive.metastore.direct.sql.batch.size as 1000 in HiveConf.java and MetastoreConf.java.

Hi @ramitg254, Would the tpcds tests fail without this PR? BTW, What is the size of the tpcds test dataset?

ramitg254 · 2025-09-30T10:14:58Z

How was this patch tested?

by running tpcds tests locally with setting hive.metastore.direct.sql.batch.size as 1000 in HiveConf.java and MetastoreConf.java.

Hi @ramitg254, Would the tpcds tests fail without this PR? BTW, What is the size of the tpcds test dataset?

Hi @zhangbutao, without this pr tpcds tests will fail whenever hive.metastore.direct.sql.batch.size > 0 in cases partition size is greater than the batch size.
there were around 55 tpcds test failures when property is set to 1000, and one such example is query16.q

and i think dataset size would be 30tb as I was using -Dtest=TestTezTPCDS30TBPerfCliDriver and it is picking it up from docker container

asf-ci-hive added the tests pending label Sep 20, 2025

HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql…

2a93861

… batch retrieve is enabled

ramitg254 force-pushed the HIVE-29203 branch from 6171f2e to 2a93861 Compare September 20, 2025 11:58

ramitg254 changed the title ~~[WIP]HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql…~~ HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… Sep 20, 2025

asf-ci-hive added tests failed tests pending tests passed and removed tests pending tests failed labels Sep 20, 2025

dropping batched test

4e9e1b9

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Sep 20, 2025

altered behaviour for batches in case kll and bit vector both are dis…

b566816

…abled

asf-ci-hive added tests pending and removed tests passed labels Sep 24, 2025

ramitg254 added 2 commits September 24, 2025 14:09

Revert altered behaviour for batches in case kll and bit vector both …

fdfa6d5

…are disabled

removal of aggrStatsUseDB

f7fea52

asf-ci-hive added tests failed and removed tests pending labels Sep 24, 2025

dengzhhu653 reviewed Sep 24, 2025

View reviewed changes

removal of columnStatisticsObjForPartitionsBatch

288f143

asf-ci-hive added tests pending and removed tests failed labels Sep 25, 2025

asf-ci-hive removed the tests pending label Sep 25, 2025

asf-ci-hive added the tests passed label Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… #6089

HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… #6089

Uh oh!

ramitg254 commented Sep 20, 2025 •

edited

Loading

Uh oh!

ramitg254 commented Sep 20, 2025 •

edited

Loading

Uh oh!

dengzhhu653 Sep 24, 2025

Uh oh!

ramitg254 Sep 25, 2025

Uh oh!

dengzhhu653 Sep 24, 2025

Uh oh!

deniskuzZ Sep 30, 2025 •

edited

Loading

Uh oh!

kasakrisz Oct 3, 2025

Uh oh!

ramitg254 Oct 3, 2025

Uh oh!

sonarqubecloud bot commented Sep 25, 2025

Uh oh!

dengzhhu653 commented Sep 25, 2025

Uh oh!

zhangbutao commented Sep 30, 2025

How was this patch tested?

Uh oh!

ramitg254 commented Sep 30, 2025 •

edited

Loading

How was this patch tested?

Uh oh!

Uh oh!

HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… #6089

Are you sure you want to change the base?

HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… #6089

Uh oh!

Conversation

ramitg254 commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ramitg254 commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dengzhhu653 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

ramitg254 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

dengzhhu653 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kasakrisz Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

ramitg254 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Sep 25, 2025

Quality Gate passed

Uh oh!

dengzhhu653 commented Sep 25, 2025

Uh oh!

zhangbutao commented Sep 30, 2025

How was this patch tested?

Uh oh!

ramitg254 commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How was this patch tested?

Uh oh!

Uh oh!

ramitg254 commented Sep 20, 2025 •

edited

Loading

ramitg254 commented Sep 20, 2025 •

edited

Loading

deniskuzZ Sep 30, 2025 •

edited

Loading

ramitg254 commented Sep 30, 2025 •

edited

Loading