-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyarrow writer not encoding correct URL for partitions in delta table #2978
Comments
I tried to install the 0.18.3 version, but it says its not available , so I installed 0.19.0 and tried to optimize. the table is written in the 0.17.4 version
|
Please use the latest version, this is already resolved |
@ion-elgreco deltalake = "0.21.0" I tried to even optimize without Z-order
But still see the partitions like this after optimize/ vacuum |
with optimize, I also do not see the number of partition files reduced, in fact with new partitions (with spaces) the file count has increased. This is expected? |
@ion-elgreco It only works sometimes and since there is broken partitions created the optimize fails sometimes due to this with the below error:
|
Try recreating the table with latest version |
@ion-elgreco Yes, like I mentioned in the previous comment both the write to table, optimize/ vacuum is done using the latest version (0.21.0) which still breaks due to spaces in partition. When I write to the table, there are no spaces in the partitions, but after optimize the spaces are created. My partition column is DayHour which is like (2024-1-09 21:00:00), is the spaces created because of this during optimize? Should we not have date and hour together as partition column? Is there an alternative we can do for this? |
@gprashmi are you on Windows by any chance? |
@thomasfrederikhoeck I have a windows laptop, but I run these on a kubeflow experiment on a databricks cluster. |
Okay. It was just because a similar issue (apache/arrow-rs#5592) has been fixed upstream but I don't think |
Yeah |
I guess this PR is fixed then it should also fix this: #2843 |
@thomasfrederikhoeck thank you for the update. Can you please let me know when would the delta-rs be updated to have the object-store=0.10.2? @ion-elgreco Based on the comment from @thomasfrederikhoeck it looks like this would be fixed when delta-rs uses the updated object-store=0.10.2 version. Can you please let me know if this is in plan to have the delta-rs updated to latest object-store version? |
Feel free to create a PR for it |
Maybe fixed by #2994 |
I'm not 100% sure this fixes this case so maybe leave it open @ion-elgreco ? |
@thomasfrederikhoeck @ion-elgreco This did not fix the issue. I installed the delta-rs as python package and updated the object_store = 0.10.2 in the Cargo.toml and tested the delta-write, optimize and vacuum. It still shows the spacing in URL: Sample code to re-produce
This resulted in spaces in the URL encoding after optimize as below: Can you please re-open this ticket? as I am unable to reopen from my end. |
@ion-elgreco Thank you for re-opening. So I guess the updated version of object_store did not help in optimize here. Please let me know if there are any other suggestions/ alternatives we can use? |
Edit: |
@thomasfrederikhoeck I think it encodes colon, but not the spaces in the dayhour column between date and hour in some URLs. and yes on AWS. |
@gprashmi does spark encode the space in all instances (write/merge/etc.) ? |
@thomasfrederikhoeck Yes when we do a write/ merge without optimize operation on the table it works fine with the URL encoding, spaces are encoded as %20 ... but if we use optimize, then it fails and creates some duplicate partitions with spaces. |
As in spark is not consistent in the encoding? |
@thomasfrederikhoeck Without optimize it is consistent, with optimize it is not consistent |
@thomasfrederikhoeck do you suggest any other database we can use or any other alternatives to optimize the table? |
@gprashmi if it is spark not being consistent should this be raised in java lib (https://github.com/delta-io/delta) instead? |
@thomasfrederikhoeck Created a issue on delta github (delta-io/delta#3892) |
Maybe related to #2308 |
Environment
Delta-rs version: 0.19.0
What happened:
We write data to delta table using delta-rs with PyArrow engine with DayHour as partition column.
I ran the optimize command using the spark sql query below on the delta table
After optimize, it creates partitions with spaces and does not properly encode the partition urls as shown in the below image i.e; it creates new partitions url with spaces (.zstd.parquet).
@ion-elgreco Can you please let me know how we can run the optimize.compact without having partitions with spaces?
Similar issue was raised in June (#2634), where it was mentioned it is fixed in the 0.18.3 version but I still see the same issue when I optimize now. To clarify, I use Pyarrow engine and not Rust if that is causing the break in partitions.
The text was updated successfully, but these errors were encountered: