You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
I'm storing highly compressible but large strings in a delta table. If I write the data frame using pandas to parquet directly, the resulting parquet file is very small (2kb for a 1MB input string containing just the letter "a").
If I write the same data frame to a delta table, the resulting file is 2.0 MB.
What you expected to happen:
I expect the delta parquet files to have a similar size as the normal parquet files.
I explicitly stated to create the files without large statistics (truncate it to 16 chars) and it seems this is is true on the row group statistics level, as the output of parquet meta suggests.
However, when inspecting the file with a hex editor I still see the uncompressed strings. Is there a column index written, and if so, how do I turn it of?
More details:
parquet meta output on the two files is also different:
on the small one:
Row group 0: count: 1 299.00 B records start: 4 total(compressed): 299 B total(uncompressed):1.000 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
name BINARY Z _ R 1 82.00 B 0 "vin1" / "vin1"
date BINARY Z _ R 1 100.00 B 0 "2022-01-01" / "2022-01-01"
large_data BINARY Z _ R 1 117.00 B 0
on the delta table one:
Row group 0: count: 1 2.000 MB records start: 4 total(compressed): 2.000 MB total(uncompressed):3.000 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
name BINARY Z RB_ 1 56.00 B "vin1" / "vin1"
date BINARY Z RB_ 1 74.00 B "2022-01-01" / "2022-01-01"
large_data BINARY Z RB_ 1 2.000 MB "aaaaaaaaaaaaaaaa" / "aaaaaaaaaaaaaaab"
The text was updated successfully, but these errors were encountered:
We are passing all the options correctly through the to ArrowWriter, I suggest you check this with the arrow-rs committers on which option you need to set to get the behavior you want https://github.com/apache/arrow-rs/issues
It looks like max_statistics_size is deprecated in Apache Arrow as of 54.0.0 and is unused (#2033, docs). And statistics_enabled should be used instead.
Environment
Delta-rs version: deltalake 0.20.2
Binding: python
Environment:
Bug
What happened:
I'm storing highly compressible but large strings in a delta table. If I write the data frame using pandas to parquet directly, the resulting parquet file is very small (2kb for a 1MB input string containing just the letter "a").
If I write the same data frame to a delta table, the resulting file is 2.0 MB.
What you expected to happen:
I expect the delta parquet files to have a similar size as the normal parquet files.
I explicitly stated to create the files without large statistics (truncate it to 16 chars) and it seems this is is true on the row group statistics level, as the output of parquet meta suggests.
However, when inspecting the file with a hex editor I still see the uncompressed strings. Is there a column index written, and if so, how do I turn it of?
How to reproduce it:
More details:
parquet meta output on the two files is also different:
on the small one:
on the delta table one:
The text was updated successfully, but these errors were encountered: