Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error accessing an iceberg table hosted in Azure blob storage #194

Open
2 tasks done
jhatcher1 opened this issue Jan 10, 2025 · 5 comments
Open
2 tasks done

Error accessing an iceberg table hosted in Azure blob storage #194

jhatcher1 opened this issue Jan 10, 2025 · 5 comments
Labels
bug Something isn't working priority-low Low priority issue user-request This issue was directly requested by a user

Comments

@jhatcher1
Copy link

What happens?

We are trying to connect to an iceberg table in Azure blob storage, using the iceberg foreign data wrapper. When creating the foreign table, we observe the error:

ERROR:  Invalid Input Error: The provided connection string does not match the storage account named iceberg@mystorageaccount

We do not see this error when using duckdb to access the iceberg table table directly, or when using the parquet foreign data wrapper to access the iceberg table's parquet files directly.

To Reproduce

These are the steps performed to reproduce the error:

$ docker run --name paradedb -e POSTGRES_PASSWORD=password paradedb/paradedb
$ docker exec -it paradedb psql -U postgres
CREATE FOREIGN DATA WRAPPER iceberg_wrapper HANDLER iceberg_fdw_handler VALIDATOR iceberg_fdw_validator;
CREATE SERVER iceberg_server FOREIGN DATA WRAPPER iceberg_wrapper;

CREATE USER MAPPING FOR postgres
SERVER iceberg_server
OPTIONS (
  type 'AZURE',
  connection_string 'DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=<access-key>;EndpointSuffix=core.windows.net'
);

CREATE FOREIGN TABLE iceberg_table ()
SERVER iceberg_server
OPTIONS (
    files 'abfss://iceberg/path/to/table/metadata/<id>.metadata.json',
    skip_schema_inference 'true'
);
ERROR:  Invalid Input Error: The provided connection string does not match the storage account named iceberg@mystorageaccount

We do not see this error when using the duckdb CLI directly

$ duckdb
INSTALL azure;
LOAD azure;

INSTALL iceberg;
LOAD iceberg;

CREATE SECRET mysecret (
    TYPE AZURE,
    CONNECTION_STRING 'DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=<access-key>;EndpointSuffix=core.windows.net'
);

SELECT *
FROM iceberg_scan(
    'abfss://iceberg/path/to/table/metadata/<id>.metadata.json',
    skip_schema_inference = true
);
# Results are displayed from the table

We also do not see this error when using the parquet foreign data wrapper:

CREATE FOREIGN DATA WRAPPER parquet_wrapper HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator;
CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;

CREATE USER MAPPING FOR postgres
SERVER parquet_server
OPTIONS (
  type 'AZURE',
  connection_string 'DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=<access-key>;EndpointSuffix=core.windows.net'
);

CREATE FOREIGN TABLE parquet_table ()
SERVER parquet_server
OPTIONS (files 'abfss://iceberg/path/to/table/data/*.parquet');
SELECT * FROM parquet_table;
# Results are displayed from the table

OS:

macOS (aarch64)

ParadeDB Version:

v0.13.2

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB Docker Image

Full Name:

Jordan Hatcher

Affiliation:

MindBridge AI

Did you include all relevant data sets for reproducing the issue?

N/A - The reproduction does not require a data set

Did you include the code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

  • Yes, I have
@jhatcher1 jhatcher1 added the bug Something isn't working label Jan 10, 2025
@philippemnoel
Copy link
Collaborator

Hi @jhatcher1. Thank you for reporting. Are you able to test querying this Iceberg metadata file from a local instance rather than from Azure Blob Storage and let us know whether that works or not? It'll make it easier for us to help.

@philippemnoel philippemnoel added good first issue Good for newcomers priority-medium Medium priority issue user-request This issue was directly requested by a user labels Jan 10, 2025
@jhatcher1
Copy link
Author

Hi @philippemnoel, I'll try to test this with a local iceberg table, but it won't be the exact same metadata since the metadata contains references to the data files in Azure blob storage.

@philippemnoel
Copy link
Collaborator

Hi @philippemnoel, I'll try to test this with a local iceberg table, but it won't be the exact same metadata since the metadata contains references to the data files in Azure blob storage.

Makes sense. Still informative as I’m trying to see whether the issue is in the iceberg extension’s integration with Azure or in the iceberg extension itself

@jhatcher1
Copy link
Author

I've tried testing with a local iceberg table:

CREATE FOREIGN DATA WRAPPER iceberg_wrapper
HANDLER iceberg_fdw_handler
VALIDATOR iceberg_fdw_validator;

CREATE SERVER iceberg_server
FOREIGN DATA WRAPPER iceberg_wrapper;

CREATE FOREIGN TABLE iceberg_table ()
SERVER iceberg_server
OPTIONS (files '/var/iceberg/permanent/myicebergtable/metadata/00000-57d6ae62-6922-4ec7-aab2-6c0481fa6397.metadata.json', skip_schema_inference 'true'
);
ERROR:  IO Error: Cannot open file "file:/var/iceberg/permanent/myicebergtable/metadata/snap-5202555973838810904-1-d76bb6c6-337c-4a45-b423-066c491c19df.avro": No such file or directory

That error seems to be caused by an upstream issue in duckdb where it doesn't handle the file protocol: duckdb/duckdb#13669

However, I think it shows that the extension was at least able to read the metadata.json file to determine the list of snapshot files to read.

@philippemnoel
Copy link
Collaborator

I've tried testing with a local iceberg table:

CREATE FOREIGN DATA WRAPPER iceberg_wrapper
HANDLER iceberg_fdw_handler
VALIDATOR iceberg_fdw_validator;

CREATE SERVER iceberg_server
FOREIGN DATA WRAPPER iceberg_wrapper;

CREATE FOREIGN TABLE iceberg_table ()
SERVER iceberg_server
OPTIONS (files '/var/iceberg/permanent/myicebergtable/metadata/00000-57d6ae62-6922-4ec7-aab2-6c0481fa6397.metadata.json', skip_schema_inference 'true'
);
ERROR:  IO Error: Cannot open file "file:/var/iceberg/permanent/myicebergtable/metadata/snap-5202555973838810904-1-d76bb6c6-337c-4a45-b423-066c491c19df.avro": No such file or directory

That error seems to be caused by an upstream issue in duckdb where it doesn't handle the file protocol: duckdb/duckdb#13669

However, I think it shows that the extension was at least able to read the metadata.json file to determine the list of snapshot files to read.

Got it. Thank you for testing. Iceberg support in DuckDB is still rather limited, and unfortunately not seeing as much movement as we'd like. We may need to wait until it improves.

@philippemnoel philippemnoel added priority-low Low priority issue and removed good first issue Good for newcomers priority-medium Medium priority issue labels Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority-low Low priority issue user-request This issue was directly requested by a user
Projects
None yet
Development

No branches or pull requests

2 participants