Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core, API, Spec: Metadata Row Lineage #11948

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

RussellSpitzer
Copy link
Member

@RussellSpitzer RussellSpitzer commented Jan 10, 2025

Adds required metadata fields and operations for working with Row Lineage. Does not cover implementation within manifest entries or propagation.

Major Changes

Metadata -

  • Adds field last-row-id for tracking the last used row-id in the table
  • Adds Snapshot field first-row-id for tracking the first row-id assigned within the snapshot
    *~~ Adds Snapshot field addedRows for explictly tracking the number of added rows in the snapshot This was not in the design doc ~~ Done in a separate PR Spec: Add added-rows field to Snapshot #11976

API-

  • adds TableMetadata operations for enabling row lineage
  • adds getters for all of the above fields

Implementation

  • Changes "addSnapshot" in Table Metadata to use the addedRows from the snapshot to increment last-row-id
  • Changed SnapshotProducer to populate added-rows and set first-row-id

@github-actions github-actions bot added API core Specification Issues that may introduce spec changes. labels Jan 10, 2025
Comment on lines 152 to 126
AtomicInteger fileNum = new AtomicInteger(0);
private DataFile fileWithRows(int numRows) {
return DataFiles.builder(PartitionSpec.unpartitioned())
.withRecordCount(numRows)
.withFileSizeInBytes(numRows * 100)
.withPath("file://file_" + fileNum.incrementAndGet() + ".parquet")
.build();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this typical in core test to not order the attribute declarations, and the methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. The ordering of the with statements? I just added the only ones that matter for the test and the "Path" is required. I can reorder them though, what ordering would you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: I was asking if we have any conventions to follow organising the code.

My question is aimed to understand the code conventions we are using in the core package (or in Iceberg generally for that matter). One such convention is published by Oracle: https://www.oracle.com/java/technologies/javase/codeconventions-fileorganization.html which suggests that instance variables should be declared at the top of the class.
Also in some places we put utility private methods at the bottom of the class (not in the Oracle convention though 😄)

For me, having this part of the code in the middle of the class between 2 tests was a little bit strange, but I could understand the reasoning that this private method and variable is only used in a single test, so it is ok to organise it as it is now. I just wanted to understand the patterns for my future PRs.

Sorry for the confusion 😢

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it to the bottom, I did just want to keep it close to this set of tests that use it, but bottom is also good

@RussellSpitzer
Copy link
Member Author

@nastra Very interested on your thoughts on my test suite here. I'm thinking about how to get away from relying on the Junit4 Test wrapper annotation in TestBase. I know I did something similar in the V3 Metadata change a while back and I'm wondering if you are on board with this direction.

@RussellSpitzer RussellSpitzer changed the title Metadata Row Lineage Core, API, Spec: Metadata Row Lineage Jan 14, 2025
@@ -43,7 +43,16 @@ public void testJsonConversion() throws IOException {

Snapshot expected =
new BaseSnapshot(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JonasJ-ap I believe has a WIP to add test builder classes for Metadata and Snapshot to make these sorts of changes less painful in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builders will be added in #11947

Copy link
Contributor

@nastra nastra Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that builders in test code are the right path going forward. For BaseSnapshot there's an easy way to avoid all the changes by having an overriding constructor. We could most likely do the same for TableMetadata, but let me first do a deep review of this PR to see what options we have here.
However, I'm curious what others think here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly against multiple constructors because I think the use cases we have here (mostly test cases) don't actually need to specify all of the parameters. Currently we have tests like

testParameter5  () {
  Snapshot = (foo, bar, x ,null, 4, ...,....,.,,..)
}

Which I think would be much cleaner as

testParameter5() {
   testSnapshot().withParameter5(true) ... some beahvior
   testSnapshot().withParameter5(false) .. some other behavior
}

Not only would the new tests be cleaner, but they would also be future proofed.

For the general use in the library I don't think there are any places in the code where we actually want to be using a constructor with less parameters. For Snapshot, we always want everything specified and for TableMetadata we expect users to go through the Public Builder and not the constructor.

@ParameterizedTest
@FieldSource("org.apache.iceberg.TestHelpers#ALL_VERSIONS")
public void testSnapshotAddition(int formatVersion) {
assumeTrue(formatVersion >= TableMetadata.MIN_FORMAT_VERSION_ROW_LINEAGE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use AssertJ assumptions: assumeThat(formatVersion).isGreaterThanOrEqualTo(...)

assertThrows(
IllegalArgumentException.class,
() -> TableMetadata.buildFromEmpty(formatVersion).enableRowLineage());
assertThat(notSupported.getMessage()).contains("Cannot use row lineage");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this (and other places) should also use AssertJ: assertThatThrownBy(() -> TableMetadata.buildFromEmpty(formatVersion).enableRowLineage())).isInstanceOf(IllegalArgumentException.class).hasMessageContaining(...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I forgot which one we were using and hit a checkstyle bug that told me not to use junit5? I don't actually know which libraries are which

}

@ParameterizedTest
@FieldSource("org.apache.iceberg.TestHelpers#ALL_VERSIONS")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like all of the methods are being executed with the same parameters. In that case it makes sense to do the parameterization at the class level. A simple example test is TestFastAppend where the class is annotated with @ExtendWith(ParameterizedTestExtension.class) and tests with @TestTemplate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I wanted to ask you about, I am not a fan of the ExtendWith ParameterizedTestExtension, it feels like we are working against the framework we choose to use. I understand it was important for backwards compatability with our existing tests but do we really want to keep be using it in the future?

When I was working on some other tests for the schema compatibility checks, having the ExtendsWith basically blocks us from using any other Junit annotations for parameter passing. For example passing two sets of parameters. With this in mind I was wondering if it makes sense to not rely on the ParameteriazedTestExtension in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that the parameterization at the class level makes it impossible to use other annotations on a single test method, but most of the time we would still parameterize at the class level and skip running a particular test method for a certain parameter.
I don't have a strong opinion here and if you think these methods will eventually all have separate test parameters, then parameterizing at the method level is probably better.
However, currently it rather looks like the only parameters across all tests would be the format versions.

Also btw the ParameterizedTestExtension was not only due to backwards compatibility during the migration process, but also as an intermediate solution until JUnit5 fully supports parameterization at the class level

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Junit plan to support class level parameterization? I would be fully on board with it in that case, I just don't want to be stuck with the extension indefinitely.

String manifestList) {
String manifestList,
Long firstRowId,
Long addedRows) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth adding an overriding constructor that takes these two new arguments? And the existing constructor would then just pass nulls for these? That way you wouldn't have to update all the constructor calls

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think there is more of an argument to have multiple constructors since there isn't a builder, but I"m still hesitant to add another constructor here. What instances to we have where we want to make a new Snapshot and not have these fields explicitly specified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one alternative could be to introduce a Snapshot Builder instead of relying on the constructor. But if we decide to do that, I'd add that builder to the core module rather than the test module (as is being done by #11947).
It would be good to get some other opinions here in order to decide what the best path forward is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough I think we can also punt on this for now and have it be a cleanup later since it's not really integral to this pr

class EnableRowLineage implements MetadataUpdate {
@Override
public void applyTo(TableMetadata.Builder metadataBuilder) {
MetadataUpdate.super.applyTo(metadataBuilder.enableRowLineage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be just metadataBuilder.enableRowLineage() without the super call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also requires changes to MetadataUpdateParser to properly pass the flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we'd need to update/vote on the REST spec changes for this change once we add it to MetadataUpdateParser

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i'm not sure what happened here, I don't remember writing this, must have been Intellij auto refactoring something when I dragged it around.

@@ -282,6 +283,29 @@ public Snapshot apply() {
throw new RuntimeIOException(e, "Failed to write manifest list file");
}

Long addedRows = null;
Long lastRowId = null;
if (base.rowLineage()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should call this rowLineageEnabled()? It would certainly read better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

@@ -288,7 +293,9 @@ public String toString() {
Map<String, SnapshotRef> refs,
List<StatisticsFile> statisticsFiles,
List<PartitionStatisticsFile> partitionStatisticsFiles,
List<MetadataUpdate> changes) {
List<MetadataUpdate> changes,
boolean rowLineage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we'd also want to have a separate constructor that takes these two new flags. The existing code that don't need to set this wouldn't have to change, wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we have a builder then it doesn't really make sense to have multiple constructors. All usages should be going through the builder, the test classes are the only ones (other than RewritePaths) that use the raw constructor and they do so incorrectly imho and should be using their own builder as well.

@@ -563,6 +578,14 @@ public TableMetadata withUUID() {
return new Builder(this).assignUUID().build();
}

public boolean rowLineage() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rowLineageEnabled()?

}
}

public Builder enableRowLineage() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels a bit weird to have both methods to effectively do the same. Maybe we should just reduce this to having a single method?

Copy link
Member Author

@RussellSpitzer RussellSpitzer Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry, when I meant to do here is we only allow a public method that turns on RowLineage. There is no public way to turn off table lineage. So I think this is good as is. The method above this is just a private helper to determine whether or not enableRowLineage should be called. I added it when I added the method for turning on row-lineage via a table property but I don't want anyone calling this outside of this class.


private static final String TEST_LOCATION = "s3://bucket/test/location";

@TempDir protected File tableDir = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@TempDir protected File tableDir = null;
@TempDir private File tableDir

currently nothing is extending this class, so we can make this private

Types.NestedField.required(2, "y", Types.LongType.get(), "comment"),
Types.NestedField.required(3, "z", Types.LongType.get()));

private TableMetadata.Builder builderFor(int formatVersion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel like this method doesn't buy as a lot since it's only used in two places

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is me starting to thinking about building a real builder for test metadata objects then giving up. I think Jonas's code is the right way forward.

"Cannot add a snapshot with a null `addedRows` field when row lineage is enabled");
}

private AtomicInteger fileNum = new AtomicInteger(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be final


private AtomicInteger fileNum = new AtomicInteger(0);

private DataFile fileWithRows(int numRows) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private DataFile fileWithRows(int numRows) {
private DataFile fileWithRows(long numRows) {

nit: you might want to move this method to the top/bottom with the fileNum field, as otherwise it seems a bit lost between test methods

format/spec.md Outdated Show resolved Hide resolved
Long addedRows = null;
Long lastRowId = null;
if (base.rowLineageEnabled()) {
addedRows =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extract line 289-306 to a utility method

@@ -388,4 +388,6 @@ private TableProperties() {}
public static final int ENCRYPTION_DEK_LENGTH_DEFAULT = 16;

public static final int ENCRYPTION_AAD_LENGTH_DEFAULT = 16;

public static final String ROW_LINEAGE = "row-lineage";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: existing properties have enabled for boolean flags. maybe row-lineage-enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only quibble here would be that we defined the field as row-lineage, so I think the property should match?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we defined the field as row-lineage

you meant in the spec? if yes, agree that we should keep them the same.

@ParameterizedTest
@FieldSource("org.apache.iceberg.TestHelpers#ALL_VERSIONS")
public void testRowLineageSupported(int formatVersion) {
if (formatVersion == TableMetadata.MIN_FORMAT_VERSION_ROW_LINEAGE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the condition probably should be >= to work for future versions like V4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, mistake on my part

@@ -262,6 +265,8 @@ public String toString() {
private volatile Map<Long, Snapshot> snapshotsById;
private volatile Map<String, SnapshotRef> refs;
private volatile boolean snapshotsLoaded;
private final Boolean rowLineageEnabled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be made a primitive boolean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me double check, currently the issue is I don't want to add a row-lineage = false field to a file that hasn't set row-lineage. I guess we don't really care about that though so maybe I should just use the boolean here.

Basically the question is do we want 3 states

  1. No row-lineage field
  2. row-lineage = false
  3. row-lineage = true

Or just 2

  1. row-lineage = false
  2. row-lineage = true

I can do the second option there, but then I think I should be omitting the field for V2 and V1 tables? I am not sure what our prior art is on this so I'll have to check ... My guiding principal here was to not write in the field at all if it's not explicitly mentioned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be easier to just leave null as an option rather than special casing write behavior in case of Versions 1 or 2. I think we've noted previously we want to be strict in what we write so I want to be sure we aren't writing down fields that aren't valid for older table versions and I think this is the easiest route to that.

@@ -615,10 +638,15 @@ public TableMetadata replaceProperties(Map<String, String> rawProperties) {
int newFormatVersion =
PropertyUtil.propertyAsInt(rawProperties, TableProperties.FORMAT_VERSION, formatVersion);

Boolean newRowLineage =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, I think this could be primitive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be null if it's not in the properties and this would indicate that the user isn't changing row-lineage or set it via a non-table-properties route. In that case we have to leave the value the same in table metadata.

So three options in this spot

  1. row-lineage is not set (do nothing) : null
  2. row-lineage is set to false : Do nothing if we are already false, fail if row-lineage is enabled
  3. row-lineage is set to true : enableRow lineage if it isn't already enabled

I currently put all that logic in the helper function setRowLineageEnabled which I kept private since it's only used by this location. I could just inline that all here though if that is clearer

Comment on lines 1270 to 1271
snapshot.firstRowId() >= lastRowId,
"Cannot add a snapshot whose first-row-id (%s) is less than the metadata `last-used-id` (%s) because this will end up generating duplicate row_ids.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be strictly greater?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no, if my first row ID is 100 and I added 40 records (it'd be 100-139 inclusive), the next snapshot should be able to start at 140 (which would be the lastRowId in this case). I think this is good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as jonas noted, I copied the wrong field name. I looked at the design docs instead of at the spec. It's much clearer with "next-row-id" changing now

public void testSnapshotAddition(int formatVersion) {
assumeThat(formatVersion).isGreaterThanOrEqualTo(TableMetadata.MIN_FORMAT_VERSION_ROW_LINEAGE);

Long newRows = 30L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: primitive?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API core Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants