-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core, API, Spec: Metadata Row Lineage #11948
base: main
Are you sure you want to change the base?
Conversation
AtomicInteger fileNum = new AtomicInteger(0); | ||
private DataFile fileWithRows(int numRows) { | ||
return DataFiles.builder(PartitionSpec.unpartitioned()) | ||
.withRecordCount(numRows) | ||
.withFileSizeInBytes(numRows * 100) | ||
.withPath("file://file_" + fileNum.incrementAndGet() + ".parquet") | ||
.build(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this typical in core test to not order the attribute declarations, and the methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. The ordering of the with
statements? I just added the only ones that matter for the test and the "Path" is required. I can reorder them though, what ordering would you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr: I was asking if we have any conventions to follow organising the code.
My question is aimed to understand the code conventions we are using in the core package (or in Iceberg generally for that matter). One such convention is published by Oracle: https://www.oracle.com/java/technologies/javase/codeconventions-fileorganization.html which suggests that instance variables should be declared at the top of the class.
Also in some places we put utility private methods at the bottom of the class (not in the Oracle convention though 😄)
For me, having this part of the code in the middle of the class between 2 tests was a little bit strange, but I could understand the reasoning that this private method and variable is only used in a single test, so it is ok to organise it as it is now. I just wanted to understand the patterns for my future PRs.
Sorry for the confusion 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move it to the bottom, I did just want to keep it close to this set of tests that use it, but bottom is also good
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
@nastra Very interested on your thoughts on my test suite here. I'm thinking about how to get away from relying on the Junit4 Test wrapper annotation in TestBase. I know I did something similar in the V3 Metadata change a while back and I'm wondering if you are on board with this direction. |
@@ -43,7 +43,16 @@ public void testJsonConversion() throws IOException { | |||
|
|||
Snapshot expected = | |||
new BaseSnapshot( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JonasJ-ap I believe has a WIP to add test builder classes for Metadata and Snapshot to make these sorts of changes less painful in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The builders will be added in #11947
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that builders in test code are the right path going forward. For BaseSnapshot
there's an easy way to avoid all the changes by having an overriding constructor. We could most likely do the same for TableMetadata, but let me first do a deep review of this PR to see what options we have here.
However, I'm curious what others think here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm strongly against multiple constructors because I think the use cases we have here (mostly test cases) don't actually need to specify all of the parameters. Currently we have tests like
testParameter5 () {
Snapshot = (foo, bar, x ,null, 4, ...,....,.,,..)
}
Which I think would be much cleaner as
testParameter5() {
testSnapshot().withParameter5(true) ... some beahvior
testSnapshot().withParameter5(false) .. some other behavior
}
Not only would the new tests be cleaner, but they would also be future proofed.
For the general use in the library I don't think there are any places in the code where we actually want to be using a constructor with less parameters. For Snapshot, we always want everything specified and for TableMetadata we expect users to go through the Public Builder and not the constructor.
@ParameterizedTest | ||
@FieldSource("org.apache.iceberg.TestHelpers#ALL_VERSIONS") | ||
public void testSnapshotAddition(int formatVersion) { | ||
assumeTrue(formatVersion >= TableMetadata.MIN_FORMAT_VERSION_ROW_LINEAGE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use AssertJ assumptions: assumeThat(formatVersion).isGreaterThanOrEqualTo(...)
assertThrows( | ||
IllegalArgumentException.class, | ||
() -> TableMetadata.buildFromEmpty(formatVersion).enableRowLineage()); | ||
assertThat(notSupported.getMessage()).contains("Cannot use row lineage"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this (and other places) should also use AssertJ: assertThatThrownBy(() -> TableMetadata.buildFromEmpty(formatVersion).enableRowLineage())).isInstanceOf(IllegalArgumentException.class).hasMessageContaining(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I forgot which one we were using and hit a checkstyle bug that told me not to use junit5? I don't actually know which libraries are which
} | ||
|
||
@ParameterizedTest | ||
@FieldSource("org.apache.iceberg.TestHelpers#ALL_VERSIONS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like all of the methods are being executed with the same parameters. In that case it makes sense to do the parameterization at the class level. A simple example test is TestFastAppend
where the class is annotated with @ExtendWith(ParameterizedTestExtension.class)
and tests with @TestTemplate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I wanted to ask you about, I am not a fan of the ExtendWith ParameterizedTestExtension, it feels like we are working against the framework we choose to use. I understand it was important for backwards compatability with our existing tests but do we really want to keep be using it in the future?
When I was working on some other tests for the schema compatibility checks, having the ExtendsWith basically blocks us from using any other Junit annotations for parameter passing. For example passing two sets of parameters. With this in mind I was wondering if it makes sense to not rely on the ParameteriazedTestExtension in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that the parameterization at the class level makes it impossible to use other annotations on a single test method, but most of the time we would still parameterize at the class level and skip running a particular test method for a certain parameter.
I don't have a strong opinion here and if you think these methods will eventually all have separate test parameters, then parameterizing at the method level is probably better.
However, currently it rather looks like the only parameters across all tests would be the format versions.
Also btw the ParameterizedTestExtension
was not only due to backwards compatibility during the migration process, but also as an intermediate solution until JUnit5 fully supports parameterization at the class level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Junit plan to support class level parameterization? I would be fully on board with it in that case, I just don't want to be stuck with the extension indefinitely.
String manifestList) { | ||
String manifestList, | ||
Long firstRowId, | ||
Long addedRows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth adding an overriding constructor that takes these two new arguments? And the existing constructor would then just pass nulls for these? That way you wouldn't have to update all the constructor calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think there is more of an argument to have multiple constructors since there isn't a builder, but I"m still hesitant to add another constructor here. What instances to we have where we want to make a new Snapshot and not have these fields explicitly specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one alternative could be to introduce a Snapshot Builder instead of relying on the constructor. But if we decide to do that, I'd add that builder to the core module rather than the test module (as is being done by #11947).
It would be good to get some other opinions here in order to decide what the best path forward is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough I think we can also punt on this for now and have it be a cleanup later since it's not really integral to this pr
class EnableRowLineage implements MetadataUpdate { | ||
@Override | ||
public void applyTo(TableMetadata.Builder metadataBuilder) { | ||
MetadataUpdate.super.applyTo(metadataBuilder.enableRowLineage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be just metadataBuilder.enableRowLineage()
without the super call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also requires changes to MetadataUpdateParser
to properly pass the flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we'd need to update/vote on the REST spec changes for this change once we add it to MetadataUpdateParser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i'm not sure what happened here, I don't remember writing this, must have been Intellij auto refactoring something when I dragged it around.
@@ -282,6 +283,29 @@ public Snapshot apply() { | |||
throw new RuntimeIOException(e, "Failed to write manifest list file"); | |||
} | |||
|
|||
Long addedRows = null; | |||
Long lastRowId = null; | |||
if (base.rowLineage()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should call this rowLineageEnabled()
? It would certainly read better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
@@ -288,7 +293,9 @@ public String toString() { | |||
Map<String, SnapshotRef> refs, | |||
List<StatisticsFile> statisticsFiles, | |||
List<PartitionStatisticsFile> partitionStatisticsFiles, | |||
List<MetadataUpdate> changes) { | |||
List<MetadataUpdate> changes, | |||
boolean rowLineage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we'd also want to have a separate constructor that takes these two new flags. The existing code that don't need to set this wouldn't have to change, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we have a builder then it doesn't really make sense to have multiple constructors. All usages should be going through the builder, the test classes are the only ones (other than RewritePaths) that use the raw constructor and they do so incorrectly imho and should be using their own builder as well.
@@ -563,6 +578,14 @@ public TableMetadata withUUID() { | |||
return new Builder(this).assignUUID().build(); | |||
} | |||
|
|||
public boolean rowLineage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rowLineageEnabled()
?
} | ||
} | ||
|
||
public Builder enableRowLineage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels a bit weird to have both methods to effectively do the same. Maybe we should just reduce this to having a single method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry, when I meant to do here is we only allow a public method that turns on RowLineage. There is no public way to turn off table lineage. So I think this is good as is. The method above this is just a private helper to determine whether or not enableRowLineage should be called. I added it when I added the method for turning on row-lineage via a table property but I don't want anyone calling this outside of this class.
|
||
private static final String TEST_LOCATION = "s3://bucket/test/location"; | ||
|
||
@TempDir protected File tableDir = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TempDir protected File tableDir = null; | |
@TempDir private File tableDir |
currently nothing is extending this class, so we can make this private
Types.NestedField.required(2, "y", Types.LongType.get(), "comment"), | ||
Types.NestedField.required(3, "z", Types.LongType.get())); | ||
|
||
private TableMetadata.Builder builderFor(int formatVersion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel like this method doesn't buy as a lot since it's only used in two places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is me starting to thinking about building a real builder for test metadata objects then giving up. I think Jonas's code is the right way forward.
"Cannot add a snapshot with a null `addedRows` field when row lineage is enabled"); | ||
} | ||
|
||
private AtomicInteger fileNum = new AtomicInteger(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be final
|
||
private AtomicInteger fileNum = new AtomicInteger(0); | ||
|
||
private DataFile fileWithRows(int numRows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private DataFile fileWithRows(int numRows) { | |
private DataFile fileWithRows(long numRows) { |
nit: you might want to move this method to the top/bottom with the fileNum
field, as otherwise it seems a bit lost between test methods
Long addedRows = null; | ||
Long lastRowId = null; | ||
if (base.rowLineageEnabled()) { | ||
addedRows = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extract line 289-306 to a utility method
@@ -388,4 +388,6 @@ private TableProperties() {} | |||
public static final int ENCRYPTION_DEK_LENGTH_DEFAULT = 16; | |||
|
|||
public static final int ENCRYPTION_AAD_LENGTH_DEFAULT = 16; | |||
|
|||
public static final String ROW_LINEAGE = "row-lineage"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: existing properties have enabled
for boolean flags. maybe row-lineage-enabled
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my only quibble here would be that we defined the field as row-lineage, so I think the property should match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we defined the field as row-lineage
you meant in the spec? if yes, agree that we should keep them the same.
@ParameterizedTest | ||
@FieldSource("org.apache.iceberg.TestHelpers#ALL_VERSIONS") | ||
public void testRowLineageSupported(int formatVersion) { | ||
if (formatVersion == TableMetadata.MIN_FORMAT_VERSION_ROW_LINEAGE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the condition probably should be >=
to work for future versions like V4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, mistake on my part
@@ -262,6 +265,8 @@ public String toString() { | |||
private volatile Map<Long, Snapshot> snapshotsById; | |||
private volatile Map<String, SnapshotRef> refs; | |||
private volatile boolean snapshotsLoaded; | |||
private final Boolean rowLineageEnabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be made a primitive boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me double check, currently the issue is I don't want to add a row-lineage = false field to a file that hasn't set row-lineage. I guess we don't really care about that though so maybe I should just use the boolean here.
Basically the question is do we want 3 states
- No row-lineage field
- row-lineage = false
- row-lineage = true
Or just 2
- row-lineage = false
- row-lineage = true
I can do the second option there, but then I think I should be omitting the field for V2 and V1 tables? I am not sure what our prior art is on this so I'll have to check ... My guiding principal here was to not write in the field at all if it's not explicitly mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be easier to just leave null as an option rather than special casing write behavior in case of Versions 1 or 2. I think we've noted previously we want to be strict in what we write so I want to be sure we aren't writing down fields that aren't valid for older table versions and I think this is the easiest route to that.
@@ -615,10 +638,15 @@ public TableMetadata replaceProperties(Map<String, String> rawProperties) { | |||
int newFormatVersion = | |||
PropertyUtil.propertyAsInt(rawProperties, TableProperties.FORMAT_VERSION, formatVersion); | |||
|
|||
Boolean newRowLineage = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, I think this could be primitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be null if it's not in the properties and this would indicate that the user isn't changing row-lineage or set it via a non-table-properties route. In that case we have to leave the value the same in table metadata.
So three options in this spot
- row-lineage is not set (do nothing) : null
- row-lineage is set to false : Do nothing if we are already false, fail if row-lineage is enabled
- row-lineage is set to true : enableRow lineage if it isn't already enabled
I currently put all that logic in the helper function setRowLineageEnabled which I kept private since it's only used by this location. I could just inline that all here though if that is clearer
snapshot.firstRowId() >= lastRowId, | ||
"Cannot add a snapshot whose first-row-id (%s) is less than the metadata `last-used-id` (%s) because this will end up generating duplicate row_ids.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be strictly greater?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no, if my first row ID is 100 and I added 40 records (it'd be 100-139 inclusive), the next snapshot should be able to start at 140 (which would be the lastRowId in this case). I think this is good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as jonas noted, I copied the wrong field name. I looked at the design docs instead of at the spec. It's much clearer with "next-row-id" changing now
public void testSnapshotAddition(int formatVersion) { | ||
assumeThat(formatVersion).isGreaterThanOrEqualTo(TableMetadata.MIN_FORMAT_VERSION_ROW_LINEAGE); | ||
|
||
Long newRows = 30L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: primitive?
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
123d1a4
to
f048cc3
Compare
Adds required metadata fields and operations for working with Row Lineage. Does not cover implementation within manifest entries or propagation.
Major Changes
Metadata -
*~~ Adds Snapshot field addedRows for explictly tracking the number of added rows in the snapshot This was not in the design doc ~~ Done in a separate PR Spec: Add added-rows field to Snapshot #11976
API-
Implementation
added-rows
and setfirst-row-id