-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very weird behavior with merge + checkpoints + optimization #3133
Comments
@aldder to help me narrow this down and understand it better. Can you try this against some older versions and let me know if you also see that behaviour |
@ion-elgreco Sure! Thank you so much for the support :) |
@ion-elgreco yes you're right, with 0.19.0 we have the error too |
@ion-elgreco dt.merge(
source=df,
predicate='t.id = s.id and t.validity_to = s.validity_to',
target_alias='t',
source_alias='s'
).when_matched_update(
updates={'validity_to': 's.validity_from'},
predicate='t.value != s.value'
).when_matched_insert_all(
predicate='t.value != s.value'
).when_not_matched_insert_all(
).execute() I'm fairly new to deltalake so i don't knw if the transaction protocol allows for more than one |
@JonasDev1 do you perhaps have time to have a look at this? It's likely due to the min-max filtering |
For me it looks like an inconsistency in the state after checkpointing a delta table. With the version 0.19.0 we rely on the min/max statistics and the predicate of the transactions. |
@JonasDev1 right, then it's actually related to this: #3057 Which I tried to address in this PR #3064, but Robert suggested it might better to refresh the table. @aldder can you force doing DeltaTable("path") each time before doing merge and compact? |
@ion-elgreco back to 0.24.0, having dt = DeltaTable('delta') before each merge and optimization operation still produre the error but this time all rows are involved:
Also, running more tests i focused on the timestamp fields (recall point 3 "If i use different types for validity_start/validity_end fields (like monotonic increasing integers or timestamp converted to string) the error does not occur." df = pd.DataFrame([{
'id': 'A',
'value': float(i),
'validity_from': pd.Timestamp.utcnow().round('us'),
'validity_to': pd.Timestamp('9999-12-31 23:59:59.999', tz='UTC') # it works with that
}]) maybe because the |
Maybe the min max stats are corrupted? |
Things I've figured out so far: The problem occurs during the first merge(which will update existing row) if there was an optimize and checkpoint beforehand. If that's the case, it will find no matches and the merge isn't applied because it skips the file during the scan The second merge works as expected. |
@JonasDev1 I am seeing |
So my observations until now, the log files correctly show the timestamp values and the parquet checkpoint as well. If @aldder notices this: "I noticed that if I change the default validity_to field to milliseconds instead microseconds the error doesn't occur." This might indicate some rounding error when the statistics are parsed from the checkpoint parquet, still diving into this though |
@roeap do you have any ideas? I didn't get any further while debugging it yesterday |
Interesting case!! @ion-elgreco, from what I can tell, I would follow the reasoning you folks established. could this somehow be related to type coercion somewhere maybe? since microsecond is technically not supported, we need to cast? I guess Microsecond values in stats would always loose precision but we would need to be careful with min / max stuff ... Just a quick thought, but will try to spend some more time this we if we don't know by then ... |
@roeap did you mean milli or nanoseconds? The primitive types should be microsecond right? At least according to protocol |
My bad - yes microseconds are the supported type. Since there is a difference when using millis, mybe still worth looking into? |
Definitely! I was looking yesterday at the stats coming out of the parquet but I didn't see anything odd. It could be also perhaps in the literal parsing in arrow/Datafusion :s |
Environment
Delta-rs version: 0.24.0
Binding:
Environment:
Bug
What happened:
I am trying to reproduce with deltalake a situation analogous to temporal/system-versioned tables on SQL, i.e. a table with fields:
id, value, validity_from, validity_to
where, when we enter data for which avalue
already exists and the newvalue
is different from the last entered for the same key, on the first entry the fieldvalidity_to
will be updated to the value ofvalidity_from
of the new entry and the new entry will be entered withvalidity_to
equal to infinite (https://sqlspreads.com/blog/temporal-tables-in-sql-server/)Example:
t=2021-01-01
we get valuevalue=10
for keyid=A
t=2021-01-02
we get valuevalue=11
for keyid=A
t=2021-01-03
we get valuevalue=11
for keyid=A
How to reproduce it:
This is the code to build this and to reproduce the error:
What you expected to happen:
I expect that all entries
validity_to
field is updated with the nextvalidity_from
field for the subsequent entry (except for the last data entry):instead i get this:
More details:
it is as if from a certain moment the match condition in the merge stopped working bringing the data into an inconsistent state.
I made a "lot" of different test and i noticed that:
validity_start/validity_end
fields (like monotonic increasing integers or timestamp converted to string) the error does not occur.I hope I've managed to explain myself properly :)
The text was updated successfully, but these errors were encountered: