-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… #6089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… batch retrieve is enabled
6171f2e
to
2a93861
Compare
dropped the batched test added earlier as |
enableBitVector, enableKll); | ||
} | ||
}); | ||
return columnStatisticsObjForPartitionsBatch(catName, dbName, tableName, partNames, inputColNames, engine, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we just aggrStatsUseJava directly here and remove the columnStatisticsObjForPartitionsBatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner); | ||
} | ||
|
||
private List<ColumnStatisticsObj> aggrStatsUseDB(String catName, String dbName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for removing the aggrStatsUseDB
, if we enable the batch retrieval, then the stats might not be aggregated per column. If we don't, we might hit the limitation of maximum parameters for PreparedStatement for some dbs.
Let's see how the test going
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do i get it right. we are moving stats aggregation from backend db to Java? what would be the impact on performance?
someone was working on this optimization and now we drop it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aggrStatsUseDB
can only be used if hive.stats.fetch.bitvector and metastore.stats.fetch.kll are false
. Some tests enable them via set
command in q files or entire tests suits via hive-site
but the default is false. Aggregating the stats at the backend db is usually faster then doing it in java so we can lose performance with this patch in some cases.
- Could you please investigate how to aggregate the results of subsequent
aggrStatsUseDB
calls? - Seems that we have test coverage for this: TestObjectStore.java.testAggrStatsUseDB Should it be removed along with
aggrStatsUseDB
?
If we don't, we might hit the limitation of maximum parameters for PreparedStatement for some dbs.
Let's see how the test going
Do we have tests using other dbs than derby and the postgres image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kasakrisz I have added aggregation for aggrStatsUseDB
earlier upto the commit b566816 but on later suggestion I removed it
|
+1. cc @kasakrisz @deniskuzZ @nrg4878 @saihemanth-cloudera @zhangbutao in case if have any other ideas. |
Hi @ramitg254, Would the tpcds tests fail without this PR? BTW, What is the size of the tpcds test dataset? |
Hi @zhangbutao, without this pr tpcds tests will fail whenever and i think dataset size would be 30tb as I was using |
… batch retrieve is enabled
What changes were proposed in this pull request?
currently flow of query when
hive.metastore.direct.sql.batch.size>0
is:aggrColStatsForPartitions -> columnStatisticsObjForPartitions -> columnStatisticsObjForPartitionsBatch -> aggrStatsUseJava -> getPartitionStats
in this case
columnStatisticsObjForPartitions
is also apply batching on partitions list which is not merged further and also this batching for partitions list is not needed asgetPartitionStats
also applies batching on partitions list which gets merged after batching and get returned viacolumnStatisticsObjForPartitions
eventually preventing any redundant entry for a given column NameWhy are the changes needed?
currently in case
hive.metastore.direct.sql.batch.size>0
thenList<ColumnStatisticsObj>
returned fromcolumnStatisticsObjForPartitions
consists of multiple entries with similar column name on which merging is not performed resulting in wrong stats.Does this PR introduce any user-facing change?
No
How was this patch tested?
by running tpcds tests locally with setting
hive.metastore.direct.sql.batch.size
as 1000 inHiveConf.java
andMetastoreConf.java
.