Skip to content

This project builds an Azure Data Lake using the Medallion architecture to process data with Spark from the Open Breweries DB API.

License

Notifications You must be signed in to change notification settings

leonardodrigo/breweries-data-lake

Repository files navigation

Open Brewery Datalake

The primary objective of this project was to develop a data pipeline that ingests brewery-related data from the Open Brewery DB API, processes it, and stores it in various layers of a datalake using Medallion Architecture. This was achieved with a stack of technologies including Airflow for orchestrating workflows, Spark for processing data, Docker and Docker Compose for containerization and orchestration, and Azure for cloud storage.

About the Data

The Open Brewery DB provides an API with public information on breweries, cideries, brewpubs, and bottleshops worldwide. For this project, we utilize the /breweries endpoint to retrieve data specifically about breweries. This is the only type of data that will be processed by this project.

Sample Data

Here is an example of the data returned by the /breweries endpoint:

{
    "id": "34e8c68b-6146-453f-a4b9-1f6cd99a5ada",
    "name": "1 of Us Brewing Company",
    "brewery_type": "micro",
    "address_1": "8100 Washington Ave",
    "address_2": null,
    "address_3": null,
    "city": "Mount Pleasant",
    "state_province": "Wisconsin",
    "postal_code": "53406-3920",
    "country": "United States",
    "longitude": "-87.88336350209435",
    "latitude": "42.72010826899558",
    "phone": "262-484-7553",
    "website_url": "https://www.1ofusbrewing.com",
    "state": "Wisconsin",
    "street": "8100 Washington Ave"
}

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

The things you need before installing the project.

Azure Setup

For storing the data for this project, we chose Azure to create a cloud-based datalake. The only Azure resource used in this project is Azure Data Lake Storage, which is configured within an Azure Storage Account. To set up Azure Data Lake Storage, follow these steps:

  1. Create an Azure Storage Account:

    • Sign in to the Azure Portal.
    • Navigate to "Storage accounts" and click "Create".
    • Choose the appropriate subscription and resource group. If you don't have a resource group, you can create a new one at this step.
    • Provide a unique name for the storage account; this will be used to connect via Airflow.
    • Select the "Region" where you want your data to be stored.
    • Choose "Standard" for "Performance" type.
    • Click Next and enable anonymous access on individual containers and hierarchical namespace.
    • Choose the "Hot" option, as data will be accessed frequently.
    • Click "Review + create" and then "Create" to provision the storage account.
  2. Configure Azure Data Lake Storage:

    • Once the storage account is created and deployed, navigate to it in the Azure Portal.
    • Go to the "Containers" section and click "Add container" to create three new containers for your datalake: bronze, silver, and gold (ensure these names are used).
  3. Set Up Access Controls:

    • Configure access permissions for your storage account and containers using "Access control (IAM)". Assign a role such as "Storage Account Contributor" to your user.
    • Go to Security + networking and then Access keys. key1 or key2 will be used to access the storage account via Airflow as well.

Installation

With prerequisites and Azure resources already installed and set up, follow these steps to clone and configure the repository on your machine.

$ git clone https://github.com/leonardodrigo/breweries-data-lake.git

Navigate to project directory.

$ cd breweries-data-lake

Configure Environment Variables:

  • Locate the env.template file in the root directory of the project.
  • Rename the file to .env.
  • Open the .env file and set your Azure storage account name and key (key1 or key2).

Now, pull the docker images defined in docker-compose.yaml and build the containers. This process may take a few minutes on the first run.

$ docker compose build

Start the Airflow and Spark containers and configure the number of Spark workers for the cluster. The command below initializes 3 Spark workers, each with 1GB of memory and 1 CPU core. You can adjust the number of workers and their configuration based on your machine's resources.

$ docker compose up --scale spark-worker=3 -d

Airflow Connections

  1. Access Airflow Webserver:

    • Open your web browser and navigate to http://localhost:8080 (It may take a few seconds after starting the containers).
    • Log in using the default username airflow and password airflow.
  2. Set Up Airflow Connections:

    • Click on "Admin" in the top menu, then select "Connections".
    • Click the "+" button to add a new connection.

    a. Open Brewery API Connection:

    • Choose "Connection Type" as "HTTP".
    • Set the "Host" to https://api.openbrewerydb.org/v1.
    • Set the "Connection Id" to open_brewery_db_api.
    • Click "Save" to store the connection.

    b. Spark Connection:

    • Click the "+" button to add another connection.
    • Choose "Connection Type" as "Spark".
    • Set the "Host" to spark://spark.
    • Set the "Port" to 7077.
    • Set "Deploy Mode" to "client".
    • Set "Spark Binary" to spark-submit.
    • Set the "Connection Id" to spark_default.
    • Click "Save" to store the connection.

Running DAG in Airflow

To run the DAG brewery-pipeline that is currently turned off in Airflow, locate it on the home page. Toggle the switch in the "Enabled" column to turn the DAG on, which will allow it to start immediately and run automatically on a daily schedule. The first time spark jobs are executed, they may take a few minutes to complete because some dependencies and JAR files need to be installed in the cluster before executing the data processing.

Observation

In my experience, executing these containers on a Mac with Apple Silicon chips can be relatively slow. This is primarily due to the architectural differences between ARM-based processors and x86 architecture, which can lead to compatibility issues and performance overhead when running Docker containers. To enhance performance, consider deploying this project in a virtual machine (VM) or on a cloud-based platform where x86 architecture is natively supported. Alternatively, you can follow these Docker instructions to set up Rosetta.

Next Steps: Monitoring and Alerting Process

To ensure the reliability and efficiency of the Open Brewery Datalake pipeline, a monitoring and alerting process is essential. This will help identify data quality issues, pipeline failures, and other potential problems.

1. Data Quality Monitoring

We will implement a dedicated DAG called brewery-quality, which will depend on the main pipeline. This DAG will:

  • Run Spark Jobs: Submit a Spark job that includes several functions, each responsible for checking specific validation rules related to data completeness, consistency, accuracy, and integrity.
  • Logging: Log validation results to a designated system (e.g., Azure Blob Storage) for historical tracking and analysis.

2. Pipeline Failure Monitoring

Utilize Airflow’s built-in features to monitor DAG execution:

  • Alerts on Failure: Configure the EmailOperator to send alerts for task failures, retries, or timeouts.
  • Retries: Set retry policies for tasks to allow for automatic re-execution in case of intermittent failures.

3. Notification System

Implement a notification system to keep stakeholders informed:

  • Email Notifications: Use the EmailOperator for immediate alerts on validation failures and pipeline issues.
  • Slack Notifications: Integrate Slack via webhooks for real-time alerts, enabling quick responses from the team.

By adopting this monitoring and alerting process, we can improve reliability and integrity, providing stakeholders with trustworthy data.

If you find this repository valuable, please consider giving it a star!

Cheers! 🍻

About

This project builds an Azure Data Lake using the Medallion architecture to process data with Spark from the Open Breweries DB API.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published