Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] File size too large - maybe stats related #2965

Open
convoi opened this issue Oct 29, 2024 · 4 comments
Open

[python] File size too large - maybe stats related #2965

convoi opened this issue Oct 29, 2024 · 4 comments
Labels
question Further information is requested

Comments

@convoi
Copy link

convoi commented Oct 29, 2024

Environment

Delta-rs version: deltalake 0.20.2

Binding: python

Environment:

  • Cloud provider:
  • OS: Mac OS Sonoma (Apple Silicon)
  • Other:

Bug

What happened:
I'm storing highly compressible but large strings in a delta table. If I write the data frame using pandas to parquet directly, the resulting parquet file is very small (2kb for a 1MB input string containing just the letter "a").
If I write the same data frame to a delta table, the resulting file is 2.0 MB.

What you expected to happen:
I expect the delta parquet files to have a similar size as the normal parquet files.
I explicitly stated to create the files without large statistics (truncate it to 16 chars) and it seems this is is true on the row group statistics level, as the output of parquet meta suggests.
However, when inspecting the file with a hex editor I still see the uncompressed strings. Is there a column index written, and if so, how do I turn it of?

How to reproduce it:

import deltalake as dt
import pandas as pd
import pyarrow as pa
one_mb = "a" * 1024 * 1024
df = pd.DataFrame({
        "name": ["vin1"] * 1,
        "date": ["2022-01-01"] * 1,
        "large_data": [one_mb] * 1,
    })
dt.write_deltalake("test_delta", df, engine="rust", mode="overwrite",
                     writer_properties=dt.WriterProperties(
                         compression="ZSTD",
                         statistics_truncate_length=16,
                         default_column_properties=ColumnProperties(dictionary_enabled=False, max_statistics_size=1),
                         column_properties={
                             "large_data": ColumnProperties(dictionary_enabled=False,
                                                           max_statistics_size=1,
                                                           bloom_filter_properties=BloomFilterProperties(
                                                               set_bloom_filter_enabled=False)),
                         }
                     ),
                     )

More details:
parquet meta output on the two files is also different:
on the small one:

Row group 0:  count: 1  299.00 B records  start: 4  total(compressed): 299 B total(uncompressed):1.000 MB 
--------------------------------------------------------------------------------
            type      encodings count     avg size   nulls   min / max
name        BINARY    Z _ R     1         82.00 B    0       "vin1" / "vin1"
date        BINARY    Z _ R     1         100.00 B   0       "2022-01-01" / "2022-01-01"
large_data  BINARY    Z _ R     1         117.00 B   0       

on the delta table one:

Row group 0:  count: 1  2.000 MB records  start: 4  total(compressed): 2.000 MB total(uncompressed):3.000 MB 
--------------------------------------------------------------------------------
            type      encodings count     avg size   nulls   min / max
name        BINARY    Z RB_     1         56.00 B            "vin1" / "vin1"
date        BINARY    Z RB_     1         74.00 B            "2022-01-01" / "2022-01-01"
large_data  BINARY    Z RB_     1         2.000 MB           "aaaaaaaaaaaaaaaa" / "aaaaaaaaaaaaaaab"

@convoi convoi added the bug Something isn't working label Oct 29, 2024
@ion-elgreco ion-elgreco added question Further information is requested and removed bug Something isn't working labels Nov 24, 2024
@ion-elgreco
Copy link
Collaborator

We are passing all the options correctly through the to ArrowWriter, I suggest you check this with the arrow-rs committers on which option you need to set to get the behavior you want https://github.com/apache/arrow-rs/issues

@convoi
Copy link
Author

convoi commented Dec 9, 2024

This was using the rust engine, so the ArrowWriter should not be used, right?

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Dec 9, 2024

Arrow writer is a rust struct, it's not the same as the pyArrow writer which is C++

@maxitg
Copy link
Contributor

maxitg commented Jan 13, 2025

It looks like max_statistics_size is deprecated in Apache Arrow as of 54.0.0 and is unused (#2033, docs). And statistics_enabled should be used instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants