- Overview
- Architecture Diagram
- Data Source
- ETL Workflow
- Setup and Getting Started
- Components
- Prerequisites
- Running the Pipeline
- Data Visualization with Tableau Desktop
- Contribution
This ETL pipeline processes and analyzes electric vehicle population data for Washington State. It is designed to extract raw data from an S3 bucket, transform it using Apache Spark on an EMR cluster, and load the cleaned data into Snowflake for analysis.
The pipeline architecture involves:
- Amazon EC2: Hosts the Airflow environment, set up using Docker and Astronomer.
- Amazon S3: Stores both raw and transformed data.
- AWS EMR: Executes the Spark scripts for data transformation.
- Snowflake: Stores the final cleaned and structured data.
- Apache Airflow: Manages and schedules the ETL workflow.
- Tableau: Visualizes the final data for analysis.
The raw data for this project is sourced from the Electric Vehicle Population Data available on Data.gov. This dataset provides a comprehensive record of electric vehicles registered in Washington State, including information such as vehicle make, model, year, etc.
This data is critical for understanding trends in electric vehicle adoption, infrastructure needs, and environmental impacts. The dataset is provided in CSV format and is stored in an Amazon S3 bucket for processing within the ETL pipeline.
- Data Extraction: Raw data is extracted from the S3 bucket.
- Data Transformation: A Spark script cleans, standardizes, and transforms the raw CSV data into a set of fact and dimension tables, which are then converted into Parquet format for efficient storage and processing.
- Data Loading: The transformed data is loaded into Snowflake.
- Data Visualization: A Tableau dashboard is created to visualize the data for analysis.
To run this project, follow the steps below:
- Install Required Tools: Ensure you have Visual Studio Code, Docker Desktop, and the Astro CLI installed.
- Set Up the Environment: Create an EC2 instance on AWS, SSH into it, and set up the Airflow environment using Docker and Astronomer.
- Prepare Data Storage: Create an S3 bucket with the necessary directory structure for storing raw and transformed data.
- Initialize Airflow: Use Astro CLI to initialize and start the Airflow environment.
- Configure Snowflake: Set up Snowflake for data loading and create the necessary tables.
- Run the ETL Pipeline: Trigger the Airflow DAG to execute the ETL process, transforming and loading the data.
- Visualize the Data: Connect Tableau to Snowflake and create a dashboard to analyze the data.
For detailed setup and execution instructions, refer to the how_to_run.docx
document.
- Airflow DAG: Manages the overall workflow, including the creation and termination of EMR clusters, execution of Spark jobs, and loading data into Snowflake.
- Spark Script: Transforms the raw data, including data cleaning, standardization, and structuring it into fact and dimension tables in Parquet format.
- Snowflake SQL: Executes SQL commands to load the transformed data into Snowflake.
- Tableau Dashboard: Visualizes the cleaned data stored in Snowflake.
- Visual Studio Code: For SSH access and project management.
- Astro CLI: To manage the Airflow environment within Docker.
- Docker Desktop: To containerize the Airflow environment.
- AWS Account: Required for EC2, S3, and EMR services.
- Snowflake Account: For storing and querying the final transformed data.
- Tableau: For visualizing the final data stored in Snowflake.
- Environment Setup: Set up the environment as described in
how_to_run.docx
, including creating an EC2 instance and configuring IAM roles. - Airflow Initialization: Initialize the Airflow environment using Astro CLI and Docker.
- Data Storage: Prepare the S3 bucket structure for storing raw and transformed data.
- Data Transformation: Set up the Spark transformation scripts to clean and convert the CSV data into fact and dimension tables in Parquet format, and run them on an EMR cluster.
- Data Loading: Configure Snowflake for data loading and execute the SQL scripts to store the Parquet data.
- Data Visualization: Create a Tableau dashboard to visualize the data stored in Snowflake.
- DAG Execution: Start the Airflow components and trigger the DAG to execute the ETL process.
Once the data is loaded into Snowflake:
- Install the ODBC Snowflake Driver: Required for Tableau Desktop to connect to Snowflake.
- Create an Extract Connection: Connect Tableau Desktop to Snowflake.
- Data Modeling in Tableau: Create a star schema and define relationships between the cleaned data tables.
- Create Visualizations: Use Tableau Desktop to process and visualize the data.
Checkout Tableau data visualization at Electric vehicle stats in Washington State | Tableau Public
Checkout how_to_run.docx file for detailed steps to run this project.
Contributions are welcome! Please open an issue or submit a pull request with any improvements.