Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to ZSTD compression when writing Parquet #981

Merged
merged 16 commits into from
Jan 11, 2025

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Dec 24, 2024

Which issue does this PR close?

Closes #978.

Rationale for this change

Currently, the write_parquet method defaults to "uncompressed" Parquet files, which can lead to inefficient storage and slower performance during I/O operations. This change sets the default compression method to "ZSTD", a modern compression algorithm that provides an excellent balance of compression speed and ratio. Additionally, it introduces a default compression level of 3 for ZSTD, which is optimal for many use cases.

What changes are included in this PR?

Updated the default compression parameter in the write_parquet method from "uncompressed" to "ZSTD".
Introduced a default compression level of 3 for ZSTD if no level is specified.
Added validation to ensure the compression level for ZSTD falls within the valid range (1 to 22) and raises a ValueError otherwise.
Updated the docstring to clarify the default values and provide guidance for users on compression levels.

Are there any user-facing changes?

Yes:

The default behavior of write_parquet now compresses output files using ZSTD with a default compression level of 3, instead of leaving files uncompressed.
Users specifying an invalid compression level for ZSTD will now encounter a ValueError.

"""
path (str | pathlib.Path): The file path to write the Parquet file.
compression (str): The compression algorithm to use. Default is "ZSTD".
compression_level (int | None): The compression level to use. For ZSTD, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document that the compression level is different per algorithm. It's only zstd that has a 1-22 range IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like

compression_level (int | None): The compression level to use. For ZSTD, the
            recommended range is 1 to 22, with the default being 3. Higher levels
            provide better compression but slower speed.

# default compression level to 3 for ZSTD
if compression == "ZSTD":
if compression_level is None:
compression_level = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to dig into what DuckDB's defaults are: https://duckdb.org/docs/data/parquet/overview.html#writing-to-parquet-files

Copy link
Contributor Author

@kosiew kosiew Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.

I used the default compression level in the manual from Facebook (author of zstd) - https://facebook.github.io/zstd/zstd_manual.html

I could not find a default in DuckDB's documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @kylebarron ,

Shall we adopt delta-rs' default, and use 4 as the default ZSTD compression level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that sounds good to me.

Copy link
Contributor Author

@kosiew kosiew Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
I have amended the default to 4.

@ion-elgreco
Copy link
Contributor

In delta-rs we have the default to use "snappy" compression, except our optimize operation which uses ZSTD(4)

recommended range is 1 to 22, with the default being 4. Higher levels
provide better compression but slower speed.
"""
if compression == "ZSTD":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@kosiew kosiew Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ion-elgreco ,

I added the Compression Enum but omitted the check_valid_levels because these are already implemented in Rust DataFrame eg

"zstd" => Compression::ZSTD(
ZstdLevel::try_new(verify_compression_level(compression_level)? as i32)
.map_err(|e| PyValueError::new_err(format!("{e}")))?,
),

Compression levels are tested in:

@pytest.mark.parametrize(
"compression, compression_level",
[("gzip", 12), ("brotli", 15), ("zstd", 23), ("wrong", 12)],
)
def test_write_compressed_parquet_wrong_compression_level(
df, tmp_path, compression, compression_level
):
path = tmp_path
with pytest.raises(ValueError):
df.write_parquet(
str(path),
compression=compression,
compression_level=compression_level,

@kosiew kosiew force-pushed the parquet-default-compression branch from 4d3fd8d to e7ec09b Compare January 8, 2025 06:02
@kosiew kosiew force-pushed the parquet-default-compression branch from e7ec09b to 41e1742 Compare January 8, 2025 06:03
Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a very nice addition. It looks like you have a slight adjustment that ruff is complaining about to fix the CI. My comments here are all minor.

@@ -620,17 +679,34 @@ def write_csv(self, path: str | pathlib.Path, with_header: bool = False) -> None
def write_parquet(
self,
path: str | pathlib.Path,
compression: str = "uncompressed",
compression: str = Compression.ZSTD.value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have this take as the type for compression std | Compression and do a quick check and get the value passed a Compression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Comment on lines 705 to 707
if compression_enum in {Compression.GZIP, Compression.BROTLI, Compression.ZSTD}:
if compression_level is None:
compression_level = compression_enum.get_default_level()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing the checking here it would be slightly more ergonomic to just call compression_enum.get_default_level() and have it return None rather than raise an error. But I could also see how some would see calling get_default_level on the others as invalid. I'm not married to this idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This passes the None handling to Rust.
No tests broken, so this is a good ergonomic suggestion.

"""Convert a string to a Compression enum value.

Args:
value (str): The string representation of the compression type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since the type hint indicates a str you shouldn't have to repeat here, per the google code design spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good nit 😄

"""Get the default compression level for the compression type.

Returns:
int: The default compression level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: int not required since it's in the hint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good nit 😄

@kosiew
Copy link
Contributor Author

kosiew commented Jan 9, 2025

Does anyone know how to fix this CI error:

image
ruff check --output-format=github python/
  ruff format --check python/
  shell: /usr/bin/bash -e {0}
  env:
    pythonLocation: /opt/hostedtoolcache/Python/[3](https://github.com/apache/datafusion-python/actions/runs/12683801800/job/35351448911#step:5:3).11.11/x64
    PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.11.11/x6[4](https://github.com/apache/datafusion-python/actions/runs/12683801800/job/35351448911#step:5:4)/lib/pkgconfig
    Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.11/x64
    Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.11/x64
    Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.11/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.11.11/x64/lib
Would reformat: python/tests/test_dataframe.py
1 file would be reformatted, 3[5](https://github.com/apache/datafusion-python/actions/runs/12683801800/job/35351448911#step:5:5) files already formatted
Error: Process completed with exit code 1.

I tried these commands and they complete without error on my machine:

ruff check --output-format=github python/
ruff format --check python/

ruff format python/tests/test_dataframe.py

@timsaucer
Copy link
Contributor

It looks like some minor difference in ruff versions probably caused yours to pass and the CI to fail. I pushed a correction to this branch.

@timsaucer timsaucer merged commit 2d8b1d3 into apache:main Jan 11, 2025
15 checks passed
@timsaucer
Copy link
Contributor

Thank you for another great addition @kosiew !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default to some compression when writing Parquet
5 participants