Skip to content

Conversation

ramitg254
Copy link
Contributor

@ramitg254 ramitg254 commented Sep 20, 2025

… batch retrieve is enabled

What changes were proposed in this pull request?

currently flow of query when hive.metastore.direct.sql.batch.size>0 is:
aggrColStatsForPartitions -> columnStatisticsObjForPartitions -> columnStatisticsObjForPartitionsBatch -> aggrStatsUseJava -> getPartitionStats
in this case columnStatisticsObjForPartitions is also apply batching on partitions list which is not merged further and also this batching for partitions list is not needed as getPartitionStats also applies batching on partitions list which gets merged after batching and get returned via columnStatisticsObjForPartitions eventually preventing any redundant entry for a given column Name

Why are the changes needed?

currently in case hive.metastore.direct.sql.batch.size>0 then List<ColumnStatisticsObj> returned from columnStatisticsObjForPartitions consists of multiple entries with similar column name on which merging is not performed resulting in wrong stats.

Does this PR introduce any user-facing change?

No

How was this patch tested?

by running tpcds tests locally with setting hive.metastore.direct.sql.batch.size as 1000 in HiveConf.java and MetastoreConf.java.

@ramitg254 ramitg254 changed the title [WIP]HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… Sep 20, 2025
@ramitg254
Copy link
Contributor Author

ramitg254 commented Sep 20, 2025

dropped the batched test added earlier as set hive.metastore.direct.sql.batch.size=1000 was not changing the value in MetastoreConf in the real time run of q test.
but an individual tpcds test can be checked with:
mvn test -pl itests/qtest -Pitests -Dtest=TestTezTPCDS30TBPerfCliDriver -Dqfile=query16.q -Dhive.metastore.direct.sql.batch.size=1000

enableBitVector, enableKll);
}
});
return columnStatisticsObjForPartitionsBatch(catName, dbName, tableName, partNames, inputColNames, engine,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we just aggrStatsUseJava directly here and remove the columnStatisticsObjForPartitionsBatch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
}

private List<ColumnStatisticsObj> aggrStatsUseDB(String catName, String dbName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removing the aggrStatsUseDB, if we enable the batch retrieval, then the stats might not be aggregated per column. If we don't, we might hit the limitation of maximum parameters for PreparedStatement for some dbs.

Let's see how the test going

Copy link
Member

@deniskuzZ deniskuzZ Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do i get it right. we are moving stats aggregation from backend db to Java? what would be the impact on performance?
someone was working on this optimization and now we drop it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggrStatsUseDB can only be used if hive.stats.fetch.bitvector and metastore.stats.fetch.kll are false. Some tests enable them via set command in q files or entire tests suits via hive-site but the default is false. Aggregating the stats at the backend db is usually faster then doing it in java so we can lose performance with this patch in some cases.

@ramitg254

  1. Could you please investigate how to aggregate the results of subsequent aggrStatsUseDB calls?
  2. Seems that we have test coverage for this: TestObjectStore.java.testAggrStatsUseDB Should it be removed along with aggrStatsUseDB?

@dengzhhu653

If we don't, we might hit the limitation of maximum parameters for PreparedStatement for some dbs.

Let's see how the test going

Do we have tests using other dbs than derby and the postgres image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kasakrisz I have added aggregation for aggrStatsUseDB earlier upto the commit b566816 but on later suggestion I removed it

Copy link

@dengzhhu653
Copy link
Member

+1. cc @kasakrisz @deniskuzZ @nrg4878 @saihemanth-cloudera @zhangbutao in case if have any other ideas.

@zhangbutao
Copy link
Contributor

How was this patch tested?

by running tpcds tests locally with setting hive.metastore.direct.sql.batch.size as 1000 in HiveConf.java and MetastoreConf.java.

Hi @ramitg254, Would the tpcds tests fail without this PR? BTW, What is the size of the tpcds test dataset?

@ramitg254
Copy link
Contributor Author

ramitg254 commented Sep 30, 2025

How was this patch tested?

by running tpcds tests locally with setting hive.metastore.direct.sql.batch.size as 1000 in HiveConf.java and MetastoreConf.java.

Hi @ramitg254, Would the tpcds tests fail without this PR? BTW, What is the size of the tpcds test dataset?

Hi @zhangbutao, without this pr tpcds tests will fail whenever hive.metastore.direct.sql.batch.size > 0 in cases partition size is greater than the batch size.
there were around 55 tpcds test failures when property is set to 1000, and one such example is query16.q

and i think dataset size would be 30tb as I was using -Dtest=TestTezTPCDS30TBPerfCliDriver and it is picking it up from docker container

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants