Washington State Electric Vehicles ETL Pipeline

Overview

This ETL pipeline processes and analyzes electric vehicle population data for Washington State. It is designed to extract raw data from an S3 bucket, transform it using Apache Spark on an EMR cluster, and load the cleaned data into Snowflake for analysis.

Architecture Diagram

The pipeline architecture involves:

Amazon EC2: Hosts the Airflow environment, set up using Docker and Astronomer.
Amazon S3: Stores both raw and transformed data.
AWS EMR: Executes the Spark scripts for data transformation.
Snowflake: Stores the final cleaned and structured data.
Apache Airflow: Manages and schedules the ETL workflow.
Tableau: Visualizes the final data for analysis.

Data Source

The raw data for this project is sourced from the Electric Vehicle Population Data available on Data.gov. This dataset provides a comprehensive record of electric vehicles registered in Washington State, including information such as vehicle make, model, year, etc.

This data is critical for understanding trends in electric vehicle adoption, infrastructure needs, and environmental impacts. The dataset is provided in CSV format and is stored in an Amazon S3 bucket for processing within the ETL pipeline.

ETL Workflow

Data Extraction: Raw data is extracted from the S3 bucket.
Data Transformation: A Spark script cleans, standardizes, and transforms the raw CSV data into a set of fact and dimension tables, which are then converted into Parquet format for efficient storage and processing.
Data Loading: The transformed data is loaded into Snowflake.
Data Visualization: A Tableau dashboard is created to visualize the data for analysis.

Setup and Getting Started

To run this project, follow the steps below:

Install Required Tools: Ensure you have Visual Studio Code, Docker Desktop, and the Astro CLI installed.
Set Up the Environment: Create an EC2 instance on AWS, SSH into it, and set up the Airflow environment using Docker and Astronomer.
Prepare Data Storage: Create an S3 bucket with the necessary directory structure for storing raw and transformed data.
Initialize Airflow: Use Astro CLI to initialize and start the Airflow environment.
Configure Snowflake: Set up Snowflake for data loading and create the necessary tables.
Run the ETL Pipeline: Trigger the Airflow DAG to execute the ETL process, transforming and loading the data.
Visualize the Data: Connect Tableau to Snowflake and create a dashboard to analyze the data.

For detailed setup and execution instructions, refer to the how_to_run.docx document.

Components

Airflow DAG: Manages the overall workflow, including the creation and termination of EMR clusters, execution of Spark jobs, and loading data into Snowflake.
Spark Script: Transforms the raw data, including data cleaning, standardization, and structuring it into fact and dimension tables in Parquet format.
Snowflake SQL: Executes SQL commands to load the transformed data into Snowflake.
Tableau Dashboard: Visualizes the cleaned data stored in Snowflake.

Prerequisites

Visual Studio Code: For SSH access and project management.
Astro CLI: To manage the Airflow environment within Docker.
Docker Desktop: To containerize the Airflow environment.
AWS Account: Required for EC2, S3, and EMR services.
Snowflake Account: For storing and querying the final transformed data.
Tableau: For visualizing the final data stored in Snowflake.

Running the Pipeline

Environment Setup: Set up the environment as described in how_to_run.docx, including creating an EC2 instance and configuring IAM roles.
Airflow Initialization: Initialize the Airflow environment using Astro CLI and Docker.
Data Storage: Prepare the S3 bucket structure for storing raw and transformed data.
Data Transformation: Set up the Spark transformation scripts to clean and convert the CSV data into fact and dimension tables in Parquet format, and run them on an EMR cluster.
Data Loading: Configure Snowflake for data loading and execute the SQL scripts to store the Parquet data.
Data Visualization: Create a Tableau dashboard to visualize the data stored in Snowflake.
DAG Execution: Start the Airflow components and trigger the DAG to execute the ETL process.

Data Visualization with Tableau Desktop

Once the data is loaded into Snowflake:

Install the ODBC Snowflake Driver: Required for Tableau Desktop to connect to Snowflake.
Create an Extract Connection: Connect Tableau Desktop to Snowflake.
Data Modeling in Tableau: Create a star schema and define relationships between the cleaned data tables.
Create Visualizations: Use Tableau Desktop to process and visualize the data.

Dashboard

Checkout Tableau data visualization at Electric vehicle stats in Washington State | Tableau Public

Checkout how_to_run.docx file for detailed steps to run this project.

Contribution

Contributions are welcome! Please open an issue or submit a pull request with any improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Washington-state-electric-vehicles-ETL-pipeline		Washington-state-electric-vehicles-ETL-pipeline
scripts		scripts
transformed data		transformed data
Airflow.py		Airflow.py
Architecture_Diagram.png		Architecture_Diagram.png
Architecture_diagram_100%zoom.png		Architecture_diagram_100%zoom.png
README.md		README.md
how_to_run.docx		how_to_run.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Washington State Electric Vehicles ETL Pipeline

Table of Contents

Overview

Architecture Diagram

Data Source

ETL Workflow

Setup and Getting Started

Components

Prerequisites

Running the Pipeline

Data Visualization with Tableau Desktop

Dashboard

Checkout Tableau data visualization at Electric vehicle stats in Washington State | Tableau Public

Checkout how_to_run.docx file for detailed steps to run this project.

Contribution

About

Releases

Packages

Languages

ravishankar324/Washington-state-electric-vehicles-ETL-pipeline

Folders and files

Latest commit

History

Repository files navigation

Washington State Electric Vehicles ETL Pipeline

Table of Contents

Overview

Architecture Diagram

Data Source

ETL Workflow

Setup and Getting Started

Components

Prerequisites

Running the Pipeline

Data Visualization with Tableau Desktop

Dashboard

Checkout Tableau data visualization at Electric vehicle stats in Washington State | Tableau Public

Checkout how_to_run.docx file for detailed steps to run this project.

Contribution

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages