Skip to content

Conversation

@codope
Copy link
Owner

@codope codope commented Feb 25, 2025

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

<artifactId>hudi-common</artifactId>
<version>${dep.hudi.version}</version>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will hbase dependencies still be required after integration with filegroup reader (even in tests)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need that because the Hudi writer needs this dependency to write HFiles and MDT. If we want to get rid of this in test dependency, we'll need to add artifacts of generated Hudi tables with MDT enabled.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just worried for psuhback from Trino committers. Should be ok for tests for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency is removed in 1.0.x, tests are passing.
May I know which execution path will actually require this dependency? I can craft a test case specifically for such scenarios.

<dependency>
<!--Used to test execution in task executor after de-serializing-->
<groupId>com.esotericsoftware</groupId>
<artifactId>kryo</artifactId>
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is only needed for test, then let;s just define in test scope? Will it still be required after integrating with filegroup reader?
Also, should we use kryo-shaded?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check why I added this.


BigDecimal convert(int precision, int scale, Object value)
{
Schema schema = new Schema.Parser().parse(format("{\"type\":\"bytes\",\"logicalType\":\"decimal\",\"precision\":%d,\"scale\":%d}", precision, scale));
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a new Avro schema is created on each call. If many decimals with the same precision and scale are processed, consider caching the schema to avoid repeated parsing overhead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix


@Override
public List<StoragePathInfo> listFiles(StoragePath path) throws IOException {
FileIterator fileIterator = fileSystem.listFiles(convertToLocation(path));
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks duplicate of listDirectEntries. Consider refactoring the two listing methods into a single helper method that returns the list of StoragePathInfo objects, then call that helper from both listDirectEntries and listFiles.

btw, are these methods actually called in some path?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will refactor.

They are called inside Hudi's file system view so they have to be implemented correctly.

@Override
public Page getNextPage() {
if (logRecordMap == null) {
try (HoodieMergedLogRecordScanner logScanner = getMergedLogRecordScanner(storage, basePath, split, readerSchema)) {
Copy link

@mxmarkovics mxmarkovics Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug here, since the HoodieMergedLogRecordScanner is created in the try-with-resources when the try block completes, logScanner.close() will be called (see here), which will also call close on the log record map object meaning that after this try block is exited the logRecordMap will no longer be null, but all of its entries will be removed. In my testing this is exactly what happens and it doesn't seem like it's actually even possible for the if (logRecord != null) branch below to ever actually be hit currently, so the snapshot table is effectively just being read as a read optimized table as it gets only the stale/last compacted data.

Can be fixed by just moving the log scanner instantiation to inside the try block instead of try with resources as the log record map will be closed below later anyways.

public void buildRecordInPage(PageBuilder pageBuilder, IndexedRecord record,
Map<Integer, String> partitionValueMap, boolean SkipMetaColumns) {
pageBuilder.declarePosition();
int startChannel = SkipMetaColumns ? HOODIE_META_COLUMNS.size() : 0;
Copy link

@mxmarkovics mxmarkovics Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes a bug (also in the other implementation below) when there are hudi meta columns present in the table schema itself and are selected as the pagebuilder blocks will contain entries for these columns, however they will always be skipped which can cause index out of bounds errors and/or type mismatch errors that get swallowed due to the block entries and type entries being out of sync. The proposed solution is to instead of passing in a boolean of whether or not to skipMetaColumns pass in an int from HudiSnapshotPageSource caller which is the appropriate number of meta columns to be skipped, which is just the total number of meta columns that exist - number of meta columns in the data columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants