-
Notifications
You must be signed in to change notification settings - Fork 0
Hudi 0 x snapshot mdt #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| <artifactId>hudi-common</artifactId> | ||
| <version>${dep.hudi.version}</version> | ||
| <groupId>org.apache.hbase</groupId> | ||
| <artifactId>hbase-common</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will hbase dependencies still be required after integration with filegroup reader (even in tests)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need that because the Hudi writer needs this dependency to write HFiles and MDT. If we want to get rid of this in test dependency, we'll need to add artifacts of generated Hudi tables with MDT enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just worried for psuhback from Trino committers. Should be ok for tests for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dependency is removed in 1.0.x, tests are passing.
May I know which execution path will actually require this dependency? I can craft a test case specifically for such scenarios.
| <dependency> | ||
| <!--Used to test execution in task executor after de-serializing--> | ||
| <groupId>com.esotericsoftware</groupId> | ||
| <artifactId>kryo</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is only needed for test, then let;s just define in test scope? Will it still be required after integrating with filegroup reader?
Also, should we use kryo-shaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check why I added this.
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSnapshotPageSource.java
Show resolved
Hide resolved
|
|
||
| BigDecimal convert(int precision, int scale, Object value) | ||
| { | ||
| Schema schema = new Schema.Parser().parse(format("{\"type\":\"bytes\",\"logicalType\":\"decimal\",\"precision\":%d,\"scale\":%d}", precision, scale)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a new Avro schema is created on each call. If many decimals with the same precision and scale are processed, consider caching the schema to avoid repeated parsing overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/storage/TrinoStorageConfiguration.java
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/storage/HudiTrinoStorage.java
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/storage/HudiTrinoStorage.java
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public List<StoragePathInfo> listFiles(StoragePath path) throws IOException { | ||
| FileIterator fileIterator = fileSystem.listFiles(convertToLocation(path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks duplicate of listDirectEntries. Consider refactoring the two listing methods into a single helper method that returns the list of StoragePathInfo objects, then call that helper from both listDirectEntries and listFiles.
btw, are these methods actually called in some path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will refactor.
They are called inside Hudi's file system view so they have to be implemented correctly.
| @Override | ||
| public Page getNextPage() { | ||
| if (logRecordMap == null) { | ||
| try (HoodieMergedLogRecordScanner logScanner = getMergedLogRecordScanner(storage, basePath, split, readerSchema)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bug here, since the HoodieMergedLogRecordScanner is created in the try-with-resources when the try block completes, logScanner.close() will be called (see here), which will also call close on the log record map object meaning that after this try block is exited the logRecordMap will no longer be null, but all of its entries will be removed. In my testing this is exactly what happens and it doesn't seem like it's actually even possible for the if (logRecord != null) branch below to ever actually be hit currently, so the snapshot table is effectively just being read as a read optimized table as it gets only the stale/last compacted data.
Can be fixed by just moving the log scanner instantiation to inside the try block instead of try with resources as the log record map will be closed below later anyways.
| public void buildRecordInPage(PageBuilder pageBuilder, IndexedRecord record, | ||
| Map<Integer, String> partitionValueMap, boolean SkipMetaColumns) { | ||
| pageBuilder.declarePosition(); | ||
| int startChannel = SkipMetaColumns ? HOODIE_META_COLUMNS.size() : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes a bug (also in the other implementation below) when there are hudi meta columns present in the table schema itself and are selected as the pagebuilder blocks will contain entries for these columns, however they will always be skipped which can cause index out of bounds errors and/or type mismatch errors that get swallowed due to the block entries and type entries being out of sync. The proposed solution is to instead of passing in a boolean of whether or not to skipMetaColumns pass in an int from HudiSnapshotPageSource caller which is the appropriate number of meta columns to be skipped, which is just the total number of meta columns that exist - number of meta columns in the data columns.
Description
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: