-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default to ZSTD compression when writing Parquet #981
Conversation
… write_parquet method
python/datafusion/dataframe.py
Outdated
""" | ||
path (str | pathlib.Path): The file path to write the Parquet file. | ||
compression (str): The compression algorithm to use. Default is "ZSTD". | ||
compression_level (int | None): The compression level to use. For ZSTD, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document that the compression level is different per algorithm. It's only zstd that has a 1-22 range IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean like
compression_level (int | None): The compression level to use. For ZSTD, the
recommended range is 1 to 22, with the default being 3. Higher levels
provide better compression but slower speed.
python/datafusion/dataframe.py
Outdated
# default compression level to 3 for ZSTD | ||
if compression == "ZSTD": | ||
if compression_level is None: | ||
compression_level = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to dig into what DuckDB's defaults are: https://duckdb.org/docs/data/parquet/overview.html#writing-to-parquet-files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.
I used the default compression level in the manual from Facebook (author of zstd) - https://facebook.github.io/zstd/zstd_manual.html
I could not find a default in DuckDB's documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @kylebarron ,
Shall we adopt delta-rs' default, and use 4 as the default ZSTD compression level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I have amended the default to 4.
In delta-rs we have the default to use "snappy" compression, except our optimize operation which uses ZSTD(4) |
python/datafusion/dataframe.py
Outdated
recommended range is 1 to 22, with the default being 4. Higher levels | ||
provide better compression but slower speed. | ||
""" | ||
if compression == "ZSTD": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to reuse this code I added in deltalake a while ago:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ion-elgreco ,
I added the Compression Enum but omitted the check_valid_levels because these are already implemented in Rust DataFrame eg
datafusion-python/src/dataframe.rs
Lines 490 to 493 in 63b13da
"zstd" => Compression::ZSTD( | |
ZstdLevel::try_new(verify_compression_level(compression_level)? as i32) | |
.map_err(|e| PyValueError::new_err(format!("{e}")))?, | |
), |
Compression levels are tested in:
datafusion-python/python/tests/test_dataframe.py
Lines 1093 to 1106 in 63b13da
@pytest.mark.parametrize( | |
"compression, compression_level", | |
[("gzip", 12), ("brotli", 15), ("zstd", 23), ("wrong", 12)], | |
) | |
def test_write_compressed_parquet_wrong_compression_level( | |
df, tmp_path, compression, compression_level | |
): | |
path = tmp_path | |
with pytest.raises(ValueError): | |
df.write_parquet( | |
str(path), | |
compression=compression, | |
compression_level=compression_level, |
…s' write_parquet method
…t writing method doc
…t compression levels
4d3fd8d
to
e7ec09b
Compare
e7ec09b
to
41e1742
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is a very nice addition. It looks like you have a slight adjustment that ruff
is complaining about to fix the CI. My comments here are all minor.
python/datafusion/dataframe.py
Outdated
@@ -620,17 +679,34 @@ def write_csv(self, path: str | pathlib.Path, with_header: bool = False) -> None | |||
def write_parquet( | |||
self, | |||
path: str | pathlib.Path, | |||
compression: str = "uncompressed", | |||
compression: str = Compression.ZSTD.value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have this take as the type for compression std | Compression
and do a quick check and get the value passed a Compression
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
python/datafusion/dataframe.py
Outdated
if compression_enum in {Compression.GZIP, Compression.BROTLI, Compression.ZSTD}: | ||
if compression_level is None: | ||
compression_level = compression_enum.get_default_level() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than doing the checking here it would be slightly more ergonomic to just call compression_enum.get_default_level()
and have it return None rather than raise an error. But I could also see how some would see calling get_default_level
on the others as invalid. I'm not married to this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This passes the None handling to Rust.
No tests broken, so this is a good ergonomic suggestion.
python/datafusion/dataframe.py
Outdated
"""Convert a string to a Compression enum value. | ||
|
||
Args: | ||
value (str): The string representation of the compression type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since the type hint indicates a str
you shouldn't have to repeat here, per the google code design spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good nit 😄
python/datafusion/dataframe.py
Outdated
"""Get the default compression level for the compression type. | ||
|
||
Returns: | ||
int: The default compression level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: int
not required since it's in the hint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good nit 😄
It looks like some minor difference in ruff versions probably caused yours to pass and the CI to fail. I pushed a correction to this branch. |
Thank you for another great addition @kosiew ! |
Which issue does this PR close?
Closes #978.
Rationale for this change
Currently, the write_parquet method defaults to "uncompressed" Parquet files, which can lead to inefficient storage and slower performance during I/O operations. This change sets the default compression method to "ZSTD", a modern compression algorithm that provides an excellent balance of compression speed and ratio. Additionally, it introduces a default compression level of 3 for ZSTD, which is optimal for many use cases.
What changes are included in this PR?
Updated the default compression parameter in the write_parquet method from "uncompressed" to "ZSTD".
Introduced a default compression level of 3 for ZSTD if no level is specified.
Added validation to ensure the compression level for ZSTD falls within the valid range (1 to 22) and raises a ValueError otherwise.
Updated the docstring to clarify the default values and provide guidance for users on compression levels.
Are there any user-facing changes?
Yes:
The default behavior of write_parquet now compresses output files using ZSTD with a default compression level of 3, instead of leaving files uncompressed.
Users specifying an invalid compression level for ZSTD will now encounter a ValueError.