🚀 Csv2Parquet is a Java-based library designed to simplify the conversion of CSV files into Parquet format, dynamically generate Avro schemas, and perform comprehensive Parquet file analysis. This tool is optimized for performance and scalability, making it ideal for processing large datasets. This library was first developed for the InfiniteStack of SciCrop and is now open-sourced as a SciCrop Academy initiative.
- CSV to Parquet Conversion: Convert CSV files to optimized and compressed Parquet files effortlessly.
- Dynamic Schema Inference: Automatically infer Avro schemas from CSV headers and sample data.
- Parquet File Analysis: Analyze Parquet files, including record counts and column statistics.
- Schema Generator: Create Avro schemas programmatically from Java
Map
objects. - JUnit 5 Test Suite: Comprehensive unit tests to validate the library's functionality.
GenericCSVToParquet
: Handles CSV to Parquet conversion with dynamic schema inference.ParquetFileAnalyzer
: Provides tools to analyze Parquet files, including record inspection and statistics generation.SchemaGenerator
: Dynamically generates Avro schemas based on field definitions.Csv2ParquetTest
: JUnit 5 test class to validate the main functionalities of the library.
- Java 17 or higher
- Maven 3.6+
The following dependencies are required and included in the pom.xml
:
<dependencies>
<!-- Apache Parquet -->
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.12.3</version>
</dependency>
<!-- Apache Commons CSV -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
<version>1.10.0</version>
</dependency>
<!-- Apache Avro -->
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.11.2</version>
</dependency>
<!-- Apache Hadoop -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.6</version>
</dependency>
<!-- JUnit 5 -->
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter</artifactId>
<version>5.10.0</version>
<scope>test</scope>
</dependency>
</dependencies>
-
Convert CSV to Parquet
-
Use
GenericCSVToParquet
to process a CSV file:java com.scicrop.infinitestack.GenericCSVToParquet /path/to/input.csv /path/to/output.parquet ,
Replace
,
with your CSV delimiter if different.
-
-
Analyze Parquet Files
-
Use
ParquetFileAnalyzer
to inspect and analyze Parquet files:java com.scicrop.infinitestack.ParquetFileAnalyzer /path/to/output.parquet
-
-
Generate Avro Schema
-
Use
SchemaGenerator
to create Avro schemas dynamically:java com.scicrop.infinitestack.SchemaGenerator
-
-
Add this repository as a dependency or clone it locally.
-
Use the classes programmatically in your application:
// Convert CSV to Parquet GenericCSVToParquet.process("/tmp/input.csv", "/tmp/output.parquet", ','); // Analyze a Parquet file ParquetFileAnalyzer analyzer = new ParquetFileAnalyzer("/tmp/output.parquet"); analyzer.print(); // Generate an Avro schema Map<String, Class<?>> fieldMap = Map.of( "InvoiceNo", String.class, "Quantity", Integer.class ); Schema schema = SchemaGenerator.getSchemaByMap(fieldMap); System.out.println(schema.toString(true));
-
Navigate to the project directory.
-
Run the tests with Maven:
mvn test
This project is licensed under the Apache License 2.0.
Contributions are welcome! Feel free to submit issues and pull requests to improve the project.
- Apache Avro for schema generation.
- Apache Parquet for efficient storage.
- Apache Commons CSV for CSV parsing.
Enjoy using Csv2Parquet! 🌟 Feel free to reach out if you have any questions or feedback.