-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[core] Row lineage core support. #5935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3e9a745
to
a1af33a
Compare
@@ -1904,6 +1904,11 @@ public InlineElement getDescription() { | |||
+ "respectively. When not configured, it will automatically determine the algorithm based on the number of columns " | |||
+ "in 'sink.clustering.by-columns'. 'order' is used for 1 column, 'zorder' for less than 5 columns, " | |||
+ "and 'hilbert' for 5 or more columns."); | |||
public static final ConfigOption<Boolean> APPEND_ROW_LINEAGE_ENABLED = | |||
key("append.row.lineage.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row-tracking.enabled
You should check the table should be append table in SchemaValidation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row-tracking.enabled
You should check the table should be append table in SchemaValidation.
OK
public static final DataField ROW_ID = | ||
new DataField(Integer.MAX_VALUE - 5, "_row_id", DataTypes.BIGINT()); | ||
|
||
public static final DataField SNAPSHOT_VERSION = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using SEQUENCE_NUMBER
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then we should make SEQUENCE_NUMBER nullable
@@ -87,6 +87,12 @@ public class SpecialFields { | |||
new DataField( | |||
Integer.MAX_VALUE - 4, "rowkind", new VarCharType(VarCharType.MAX_LENGTH)); | |||
|
|||
public static final DataField ROW_ID = | |||
new DataField(Integer.MAX_VALUE - 5, "_row_id", DataTypes.BIGINT()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ROW_ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -52,6 +54,10 @@ public class AppendCompactTask { | |||
public AppendCompactTask(BinaryRow partition, List<DataFileMeta> files) { | |||
Preconditions.checkArgument(files != null); | |||
this.partition = partition; | |||
files.sort( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to keep the row id in one file in order, this will may make "add new column function" more convenient.
new DataField(17, "_EXTERNAL_PATH", newStringType(true)))); | ||
new DataField(17, "_EXTERNAL_PATH", newStringType(true)), | ||
new DataField(18, "_ROW_ID", new BigIntType(true)), | ||
new DataField(19, "_SEQUENCE_NUMBER", new BigIntType(true)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use min max sequence number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -83,7 +83,9 @@ public class DataFileMeta { | |||
16, | |||
"_VALUE_STATS_COLS", | |||
DataTypes.ARRAY(DataTypes.STRING().notNull())), | |||
new DataField(17, "_EXTERNAL_PATH", newStringType(true)))); | |||
new DataField(17, "_EXTERNAL_PATH", newStringType(true)), | |||
new DataField(18, "_ROW_ID", new BigIntType(true)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_FIRST_ROW_ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
813df40
to
2eec9cc
Compare
@@ -74,7 +74,7 @@ public class SpecialFields { | |||
public static final int KEY_FIELD_ID_START = SYSTEM_FIELD_ID_START; | |||
|
|||
public static final DataField SEQUENCE_NUMBER = | |||
new DataField(Integer.MAX_VALUE - 1, "_SEQUENCE_NUMBER", DataTypes.BIGINT().notNull()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add nullable to append table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
private final long minSequenceNumber; | ||
private final long maxSequenceNumber; | ||
// As for row-lineage table, this will be reassigned while committing | ||
private long minSequenceNumber; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use copy instead making it variable.
@@ -74,6 +78,13 @@ public CommitMessage doCompact(FileStoreTable table, BaseAppendFileStoreWrite wr | |||
Preconditions.checkArgument( | |||
dvEnabled || compactBefore.size() > 1, | |||
"AppendOnlyCompactionTask need more than one file input."); | |||
if (table.coreOptions().rowTrackingEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this? Let's finish compact in next PR.
@@ -124,6 +126,8 @@ public class DataFileMeta { | |||
/** external path of file, if it is null, it is in the default warehouse path. */ | |||
private final @Nullable String externalPath; | |||
|
|||
private @Nullable Long rowIdStart; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_FIRST_ROW_ID or rowIdStart?
|
||
/** An counter that sums up {@code long} values. */ | ||
public class LongCounter implements Serializable { | ||
public class LongCounter implements SequenceNumberCounter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to change this class. This class just have 50 lines, we don't need to reuse codes for it.
import org.apache.paimon.data.InternalRow; | ||
|
||
/** Sequence number counter only generate min sequence. */ | ||
public class MinSequenceCounter implements SequenceNumberCounter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove it, it just for compact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
// assign row id for new files | ||
long start = startRowId; | ||
for (ManifestEntry entry : deltaFiles) { | ||
if (entry.file().fileSource().orElse(FileSource.COMPACT).equals(FileSource.APPEND)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert must have fileSource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
import org.apache.paimon.types.RowKind; | ||
|
||
/** Row with row lineage inject in. */ | ||
public class PartialMappingRow implements InternalRow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RowWithLineage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -81,6 +102,29 @@ public FileRecordIterator<InternalRow> readBatch() throws IOException { | |||
final ProjectedRow projectedRow = ProjectedRow.from(indexMapping); | |||
iterator = iterator.transform(projectedRow::replaceRow); | |||
} | |||
|
|||
if (rowLineageEnabled && !metaColumnIndex.isEmpty()) { | |||
GenericRow genericRow = new GenericRow(metaColumnIndex.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lineageRow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
||
@Override | ||
public boolean isNullAt(int pos) { | ||
for (int i = 0; i < mappings.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
/** A {@link Table} for reading row id of table. */ | ||
public class RowLineageTable implements DataTable, ReadonlyTable { | ||
|
||
public static final String WITH_METADATA = "row_lineage"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ROW_LINEAGE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
6abe0be
to
7b0f148
Compare
@@ -1911,6 +1911,11 @@ public InlineElement getDescription() { | |||
+ "respectively. When not configured, it will automatically determine the algorithm based on the number of columns " | |||
+ "in 'sink.clustering.by-columns'. 'order' is used for 1 column, 'zorder' for less than 5 columns, " | |||
+ "and 'hilbert' for 5 or more columns."); | |||
public static final ConfigOption<Boolean> ROW_TRACKING_ENABLED = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep a line above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -124,6 +126,8 @@ public class DataFileMeta { | |||
/** external path of file, if it is null, it is in the default warehouse path. */ | |||
private final @Nullable String externalPath; | |||
|
|||
private @Nullable Long firstRowId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
this.firstRowId = firstRowId; | ||
} | ||
|
||
public void setFirstRowId(@Nullable Long firstRowId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assignFirstRowId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
firstRowId); | ||
} | ||
|
||
public DataFileMeta copyWithMaxSequenceNumber(long maxSequenceNumber) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assignSequenceNumber(min, max)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
if (rowLineageEnabled && !metaColumnIndex.isEmpty()) { | ||
GenericRow lineageRow = new GenericRow(metaColumnIndex.size()); | ||
|
||
int[] fallbackToMetaRowLineageMappings = new int[tableRowType.getFieldCount()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fallbackToLineageMappings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
boolean rowLineageEnabled, | ||
@Nullable Long firstRowId, | ||
long snapshotId, | ||
Map<String, Integer> metaColumnIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
systemFields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
private final boolean rowLineageEnabled; | ||
@Nullable private final Long firstRowId; | ||
private final long snapshotId; | ||
private final Map<String, Integer> metaColumnIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
systemFields
private final FileRecordReader<InternalRow> reader; | ||
@Nullable private final int[] indexMapping; | ||
@Nullable private final PartitionInfo partitionInfo; | ||
@Nullable private final CastFieldGetter[] castMapping; | ||
private final boolean rowLineageEnabled; | ||
@Nullable private final Long firstRowId; | ||
private final long snapshotId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxSequenceNumber
@@ -76,7 +76,8 @@ public void testCompatibilityToV4CommitV7() throws IOException { | |||
new byte[] {1, 2, 4}, | |||
FileSource.COMPACT, | |||
Arrays.asList("field1", "field2", "field3"), | |||
"hdfs://localhost:9000/path/to/file"); | |||
"hdfs://localhost:9000/path/to/file", | |||
null); | |||
List<DataFileMeta> dataFiles = Collections.singletonList(dataFile); | |||
|
|||
LinkedHashMap<String, DeletionVectorMeta> dvMetas = new LinkedHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -426,7 +427,10 @@ private static FunctionWithIOException<DataInputView, DataFileMeta> getFileMetaS | |||
} else if (version == 3 || version == 4) { | |||
DataFileMeta10LegacySerializer serializer = new DataFileMeta10LegacySerializer(); | |||
return serializer::deserialize; | |||
} else if (version >= 5) { | |||
} else if (version == 5 || version == 6) { | |||
DataFileMeta12LegacySerializer serializer = new DataFileMeta12LegacySerializer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -136,7 +140,8 @@ public static DataFileMeta forAppend( | |||
@Nullable byte[] embeddedIndex, | |||
@Nullable FileSource fileSource, | |||
@Nullable List<String> valueStatsCols, | |||
@Nullable String externalPath) { | |||
@Nullable String externalPath, | |||
@Nullable Long rowIdStart) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
firstRowId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -165,8 +173,13 @@ public FormatReaderMapping build( | |||
|
|||
// extract the whole data fields in logic. | |||
List<DataField> allDataFields = fieldsExtractor.apply(dataSchema); | |||
if (CoreOptions.fromMap(dataSchema.options()).rowTrackingEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass rowTrackingEnabled
to this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Purpose
For paimon append table (bucket = -1), support new mod, row with unique row id.
This pull request introduce a new kind of table, which based on unaware bucket append table. It writes external two column in file (_row_id BIGINT, _snapshot_version BIGINT), it follows the following rules:
RELATED:
https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=373886632&draftShareId=6a8aca3b-913f-4123-83e4-9106f2825736&
Tests
API and Format
Documentation