Skip to content

quzanh1130/chat_moderation_mlops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chat Moderation ML Project

Chat Moderation Technologies

Table of Contents

  1. Introduction
  2. Problem Description
  3. Solution Approach
  4. Dataset
  5. Technologies Used
  6. Experiment Tracking and Model Registry
  7. Workflow Orchestration
  8. Model Deployment
  9. Model Monitoring
  10. Reproducibility

Introduction

In this project, we will develop a machine learning model for chat moderation. The primary goal is to classify messages as positive or negative, enabling a better user experience by filtering out negative chat content.

Problem Description

The rise of online platforms has led to an increase in user-generated content, often resulting in negative and harmful comments that can affect user experience. This project aims to provide a chat moderation solution by classifying messages into positive or negative categories. The primary goal is to ensure a healthy and friendly environment for users by automatically filtering and moderating content.

  • Why is this important?
    Online communities require moderation to maintain a safe environment, and automating this process can save time and reduce the risk of human error.

  • What problem does it solve?
    This project will automate the moderation process, ensuring timely intervention for negative comments and allowing for a seamless user experience.

Solution Approach

Our solution will involve the following steps:

  1. Data Collection and Preparation: Collect and preprocess chat data to build a robust training dataset.
  2. Model Training: Train a machine learning model to classify messages as positive or negative.
  3. Model Deployment: Deploy the model as a web service for real-time moderation.
  4. Model Monitoring: Monitor the performance and accuracy of the model post-deployment.
  5. Model Retraining: Implement an automated pipeline for retraining the model with new data when necessary.

Dataset

We will use the Kaggle Science popular comment removal which contains data labeled as positive and negative based on sentiment analysis of tweets.

Data Overview

About Dataset

Context
In the Age of the Internet, we humans as a species have become increasingly connected with each other. Unfortunately, that's not always a good thing. Sometimes we end up inadvertently connecting with people we'd really rather not talk to at all, and it ruins our day. In fact, trolls abound on the internet, and it's become increasingly difficult to find quality online discussions. Many online publishers simply do not allow commenting because of how easy it is for a few trolls to derail an otherwise illuminating discussion.

But maybe we can fix all that with the power of data science.

Content
The dataset is a CSV of about 30k Reddit comments made in /r/science between Jan 2017 and June 2018. 10k of the comments were removed by moderators; the original text for these comments was recovered using the pushshift.io API. Each comment is a top-level reply to the parent post and has a comment score of 14 or higher.

Acknowledgements
The dataset comes from Google BigQuery, Reddit, and Pushshift.io.

Thanks to Jesper Wrang of removeddit for advising on how to construct the dataset.

Thanks to Jigsaw for hosting the Toxic Comment Classification Kaggle Challenge -- from which I learned a lot about NLP.

Thanks to the participants of said challenge -- I borrow heavily from your results.

Inspiration

  • Can we help reduce moderator burnout by automating comment removal?
  • What features are most predictive of popular comments getting removed?

Technologies Used

In this project, we have utilized a combination of powerful technologies to ensure scalability, maintainability, and efficiency:

  • Cloud:

    • AWS S3 for data storage: Storing datasets and model artifacts efficiently.
    • AWS EC2 for compute resources: Running compute-intensive tasks and hosting web services.
    • AWS Lambda for serverless functions: Executing lightweight tasks without managing servers.
  • Experiment Tracking:

    • MLFlow: Tracking experiments, logging parameters, and managing the model lifecycle with ease.
  • Workflow Orchestration:

    • Mage: Building and managing data pipelines and workflows with a focus on simplicity and scalability.
  • Model Deployment:

    • Docker: Containerizing applications to ensure consistency across different environments.
    • AWS ECR: Storing Docker images securely and facilitating deployment to AWS services.
  • Monitoring:

    • Grafana: Monitoring model performance and visualizing key metrics through interactive dashboards.
    • Evidently: Tracking model drift and performance metrics to ensure model reliability.
  • CI/CD:

    • GitHub Actions: Automating testing, integration, and deployment processes.
  • Infrastructure as Code (IaC):

    • Terraform: Provisioning and managing cloud resources using declarative configuration files.

This combination of technologies not only ensures a robust end-to-end machine learning workflow but also enhances collaboration, reproducibility, and scalability.

Experiment Tracking and Model Registry

Experiment Tracking with MLFlow

  • Setup: We used MLFlow to track experiments, log hyperparameters, and record metrics during training. This helped us to keep track of different models and select the best-performing one.

  • Components Tracked:

    • Model accuracy and loss
    • Training and validation datasets
    • Hyperparameters like learning rate, batch size, and epochs
  • Screenshots:

    MLFlow Experiment Tracking

Model Registry

  • Purpose: The model registry within MLFlow helped us manage different versions of the model, ensuring that only the best models are used in production.
  • Process: Models were registered after each successful experiment, allowing for easy deployment and rollback if necessary.

Workflow Orchestration

Mage for Orchestration

  • Description: Mage is used to orchestrate the data pipelines and workflow processes involved in the chat moderation project. It simplifies complex workflow creation and management.
  • Key Features:
    • Supports building pipelines using code, visual interfaces, or notebooks.
    • Provides an easy way to manage dependencies and trigger workflows.
    • Offers scalability to handle large data processing tasks.

Pipeline Setup with Mage

Mage allows you to define workflows using a declarative syntax. Here’s an example of how you might set up a pipeline to preprocess data, train a model, and deploy it:

  • Data Preparation Pipeline:

    MLFlow Experiment Tracking

  • Model Training and Tracking Pipeline:

    MLFlow Experiment Tracking

Model Deployment

Deployment Strategy

  • Containerization: Docker was used to containerize the application, ensuring consistency across environments.
  • Kubernetes: The containerized application was deployed on Kubernetes for easy scaling and management.

Deployment Process

  1. Containerize the Model:
    Using Docker, we created a container that includes the model, along with the necessary environment and dependencies.

       FROM python:3.9
    
       WORKDIR /app
    
       COPY requirements.txt /app/
    
       RUN pip install -r requirements.txt
    
       COPY . /app
    
       EXPOSE 9696
    
       CMD ["gunicorn", "-b", "0.0.0.0:9696", "predict:app"]
    
  2. Model Deployment:

To deploy the containerized application on Lambda, follow these steps: Guide for push container to ECR

  1. Upload this container to ECR:

      # Build Dockerfile
      docker build -t webservice-model-moderation:v1 .
    
      # Remember to set up AWS CONFIGURE
      aws ecr create-repository --repository-name webservice-model-moderation:v1
  2. Create a Lambda service using container in ECR

Deployment Workflow

  1. Local Testing: Verify the model locally using Docker.
  2. Push to Registry: Push the Docker image to a container registry.
  3. Deploy to Kubernetes: Use Kubernetes to deploy the application.
  4. Scale and Manage: Adjust replicas and resources as needed using Kubernetes.

Model Monitoring

Grafana for Monitoring

Description: Grafana is used for monitoring the performance of the deployed model. It provides real-time insights into the model's predictions, resource utilization, and alerts for any anomalies.

Key Features:

  • Interactive dashboards that visualize key metrics and logs.
  • Integration with Prometheus, Elasticsearch, and other data sources for comprehensive monitoring.
  • Custom alerting to notify the team about performance issues or data drift.

Alert System

Not implement

Reproducibility

Ensuring that the project is reproducible was a key focus. We followed best practices to make the code easy to run for anyone who wants to reproduce our results.

Steps for Reproduction

  1. Clone the Repository:

    git clone https://github.com/quzanh1130/chat_moderation_mlops
    cd chat_moderation_mlops
  2. Deploy Mlflow (Tracking experiment):

  1. Make Baseline:

Run baseline code to have the init model and explore about data

  cd baseline

  conda create -n chat python=3.8 -y
  conda activate chat
  pip install -r requirements.txt

Run base baseline_model_chat_removal.ipynb notebook file.

  • Download data
  • Explore Data
  • Preprocess Data
  • Make Train, Val, Test
  • Select Model
  • Hyper Paramater
  • Set up AWS Configure
  • Tracking experiment with Mlflow
  • Using Evidently to make Report Data and Model
  1. Set up Mage (Orchestation):

  2. Set up Grafana and Evident (Monitoring):

    Not implement

  3. Set up CI/CD and Deploy Webservice:

    Not implement

Code Formatter and Linter

Not implement

Makefile

 implement

About

Final project for MLOps Zoomcamp 2024

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages