Reddit ETL Data Engineering Pipeline

Reddit ETL Data Engineering Pipeline is an automated workflow designed to extract, transform, and load Reddit data, specifically from the "dataengineering" subreddit. The pipeline is built using Apache Airflow, facilitating task scheduling and orchestration. It features the following steps:

Reddit Data Extraction: A Python script pulls data from Reddit, filtering posts by the "day" time range and limiting the extraction to 300 posts. The extracted data is saved locally in a specified file format.
Upload to AWS S3: The extracted data is then uploaded to an Amazon S3 bucket for storage and further processing.
Load to Redshift: After storage in S3, the data is transferred to Amazon Redshift for analytics and reporting. The data is copied into a Redshift table using the S3ToRedshiftOperator.

This pipeline ensures seamless data extraction and storage, making the Reddit data accessible for further analysis and insights within a scalable, cloud-based environment.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.astro		.astro
dags		dags
include		include
tests/dags		tests/dags
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit ETL Data Engineering Pipeline

About

Releases

Packages

Languages

Undisputed-jay/Reddit_ETL_pipeline-using-Apache_Airflow

Folders and files

Latest commit

History

Repository files navigation

Reddit ETL Data Engineering Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages