Csv2Parquet

🚀 Csv2Parquet is a Java-based library designed to simplify the conversion of CSV files into Parquet format, dynamically generate Avro schemas, and perform comprehensive Parquet file analysis. This tool is optimized for performance and scalability, making it ideal for processing large datasets. This library was first developed for the InfiniteStack of SciCrop and is now open-sourced as a SciCrop Academy initiative.

🌟 Features

CSV to Parquet Conversion: Convert CSV files to optimized and compressed Parquet files effortlessly.
Dynamic Schema Inference: Automatically infer Avro schemas from CSV headers and sample data.
Parquet File Analysis: Analyze Parquet files, including record counts and column statistics.
Schema Generator: Create Avro schemas programmatically from Java Map objects.
JUnit 5 Test Suite: Comprehensive unit tests to validate the library's functionality.

📂 Project Structure

GenericCSVToParquet: Handles CSV to Parquet conversion with dynamic schema inference.
ParquetFileAnalyzer: Provides tools to analyze Parquet files, including record inspection and statistics generation.
SchemaGenerator: Dynamically generates Avro schemas based on field definitions.
Csv2ParquetTest: JUnit 5 test class to validate the main functionalities of the library.

🔧 Prerequisites

Java 17 or higher
Maven 3.6+

📦 Dependencies

The following dependencies are required and included in the pom.xml:

<dependencies>
    <!-- Apache Parquet -->
    <dependency>
        <groupId>org.apache.parquet</groupId>
        <artifactId>parquet-avro</artifactId>
        <version>1.12.3</version>
    </dependency>

    <!-- Apache Commons CSV -->
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-csv</artifactId>
        <version>1.10.0</version>
    </dependency>

    <!-- Apache Avro -->
    <dependency>
        <groupId>org.apache.avro</groupId>
        <artifactId>avro</artifactId>
        <version>1.11.2</version>
    </dependency>

    <!-- Apache Hadoop -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.6</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.6</version>
    </dependency>

    <!-- JUnit 5 -->
    <dependency>
        <groupId>org.junit.jupiter</groupId>
        <artifactId>junit-jupiter</artifactId>
        <version>5.10.0</version>
        <scope>test</scope>
    </dependency>
</dependencies>

🚀 How to Use

Running via Main Method

Convert CSV to Parquet
- Use GenericCSVToParquet to process a CSV file:
```
java com.scicrop.infinitestack.GenericCSVToParquet /path/to/input.csv /path/to/output.parquet ,
```
  Replace , with your CSV delimiter if different.
Analyze Parquet Files
- Use ParquetFileAnalyzer to inspect and analyze Parquet files:
```
java com.scicrop.infinitestack.ParquetFileAnalyzer /path/to/output.parquet
```
Generate Avro Schema
- Use SchemaGenerator to create Avro schemas dynamically:
```
java com.scicrop.infinitestack.SchemaGenerator
```

Integrating into Your Project

Add this repository as a dependency or clone it locally.

Use the classes programmatically in your application:

// Convert CSV to Parquet
GenericCSVToParquet.process("/tmp/input.csv", "/tmp/output.parquet", ',');

// Analyze a Parquet file
ParquetFileAnalyzer analyzer = new ParquetFileAnalyzer("/tmp/output.parquet");
analyzer.print();

// Generate an Avro schema
Map<String, Class<?>> fieldMap = Map.of(
    "InvoiceNo", String.class,
    "Quantity", Integer.class
);
Schema schema = SchemaGenerator.getSchemaByMap(fieldMap);
System.out.println(schema.toString(true));

🧪 Running Tests

Navigate to the project directory.
Run the tests with Maven:
```
mvn test
```

📝 License

This project is licensed under the Apache License 2.0.

❤️ Contributing

Contributions are welcome! Feel free to submit issues and pull requests to improve the project.

🌟 Acknowledgements

Apache Avro for schema generation.
Apache Parquet for efficient storage.
Apache Commons CSV for CSV parsing.

Enjoy using Csv2Parquet! 🌟 Feel free to reach out if you have any questions or feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Csv2Parquet

🌟 Features

📂 Project Structure

🔧 Prerequisites

📦 Dependencies

🚀 How to Use

Running via Main Method

Integrating into Your Project

🧪 Running Tests

📝 License

❤️ Contributing

🌟 Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Csv2Parquet

🌟 Features

📂 Project Structure

🔧 Prerequisites

📦 Dependencies

🚀 How to Use

Running via Main Method

Integrating into Your Project

🧪 Running Tests

📝 License

❤️ Contributing

🌟 Acknowledgements