Chat Moderation ML Project

Introduction

In this project, we will develop a machine learning model for chat moderation. The primary goal is to classify messages as positive or negative, enabling a better user experience by filtering out negative chat content.

Problem Description

The rise of online platforms has led to an increase in user-generated content, often resulting in negative and harmful comments that can affect user experience. This project aims to provide a chat moderation solution by classifying messages into positive or negative categories. The primary goal is to ensure a healthy and friendly environment for users by automatically filtering and moderating content.

Why is this important?
Online communities require moderation to maintain a safe environment, and automating this process can save time and reduce the risk of human error.
What problem does it solve?
This project will automate the moderation process, ensuring timely intervention for negative comments and allowing for a seamless user experience.

Solution Approach

Our solution will involve the following steps:

Data Collection and Preparation: Collect and preprocess chat data to build a robust training dataset.
Model Training: Train a machine learning model to classify messages as positive or negative.
Model Deployment: Deploy the model as a web service for real-time moderation.
Model Monitoring: Monitor the performance and accuracy of the model post-deployment.
Model Retraining: Implement an automated pipeline for retraining the model with new data when necessary.

Dataset

We will use the Kaggle Science popular comment removal which contains data labeled as positive and negative based on sentiment analysis of tweets.

Data Overview

About Dataset

Context
In the Age of the Internet, we humans as a species have become increasingly connected with each other. Unfortunately, that's not always a good thing. Sometimes we end up inadvertently connecting with people we'd really rather not talk to at all, and it ruins our day. In fact, trolls abound on the internet, and it's become increasingly difficult to find quality online discussions. Many online publishers simply do not allow commenting because of how easy it is for a few trolls to derail an otherwise illuminating discussion.

But maybe we can fix all that with the power of data science.

Content
The dataset is a CSV of about 30k Reddit comments made in /r/science between Jan 2017 and June 2018. 10k of the comments were removed by moderators; the original text for these comments was recovered using the pushshift.io API. Each comment is a top-level reply to the parent post and has a comment score of 14 or higher.

Acknowledgements
The dataset comes from Google BigQuery, Reddit, and Pushshift.io.

Thanks to Jesper Wrang of removeddit for advising on how to construct the dataset.

Thanks to Jigsaw for hosting the Toxic Comment Classification Kaggle Challenge -- from which I learned a lot about NLP.

Thanks to the participants of said challenge -- I borrow heavily from your results.

Inspiration

Can we help reduce moderator burnout by automating comment removal?
What features are most predictive of popular comments getting removed?

Technologies Used

In this project, we have utilized a combination of powerful technologies to ensure scalability, maintainability, and efficiency:

Cloud:
- AWS S3 for data storage: Storing datasets and model artifacts efficiently.
- AWS EC2 for compute resources: Running compute-intensive tasks and hosting web services.
- AWS Lambda for serverless functions: Executing lightweight tasks without managing servers.
Experiment Tracking:
- MLFlow: Tracking experiments, logging parameters, and managing the model lifecycle with ease.
Workflow Orchestration:
- Mage: Building and managing data pipelines and workflows with a focus on simplicity and scalability.
Model Deployment:
- Docker: Containerizing applications to ensure consistency across different environments.
- AWS ECR: Storing Docker images securely and facilitating deployment to AWS services.
Monitoring:
- Grafana: Monitoring model performance and visualizing key metrics through interactive dashboards.
- Evidently: Tracking model drift and performance metrics to ensure model reliability.
CI/CD:
- GitHub Actions: Automating testing, integration, and deployment processes.
Infrastructure as Code (IaC):
- Terraform: Provisioning and managing cloud resources using declarative configuration files.

This combination of technologies not only ensures a robust end-to-end machine learning workflow but also enhances collaboration, reproducibility, and scalability.

Experiment Tracking and Model Registry

Experiment Tracking with MLFlow

Setup: We used MLFlow to track experiments, log hyperparameters, and record metrics during training. This helped us to keep track of different models and select the best-performing one.
Components Tracked:
- Model accuracy and loss
- Training and validation datasets
- Hyperparameters like learning rate, batch size, and epochs
Screenshots:

Model Registry

Purpose: The model registry within MLFlow helped us manage different versions of the model, ensuring that only the best models are used in production.
Process: Models were registered after each successful experiment, allowing for easy deployment and rollback if necessary.

Workflow Orchestration

Mage for Orchestration

Description: Mage is used to orchestrate the data pipelines and workflow processes involved in the chat moderation project. It simplifies complex workflow creation and management.
Key Features:
- Supports building pipelines using code, visual interfaces, or notebooks.
- Provides an easy way to manage dependencies and trigger workflows.
- Offers scalability to handle large data processing tasks.

Pipeline Setup with Mage

Mage allows you to define workflows using a declarative syntax. Here’s an example of how you might set up a pipeline to preprocess data, train a model, and deploy it:

Data Preparation Pipeline:
Model Training and Tracking Pipeline:

Model Deployment

Deployment Strategy

Containerization: Docker was used to containerize the application, ensuring consistency across environments.
Kubernetes: The containerized application was deployed on Kubernetes for easy scaling and management.

Deployment Process

Containerize the Model:
Using Docker, we created a container that includes the model, along with the necessary environment and dependencies.

   FROM python:3.9

   WORKDIR /app

   COPY requirements.txt /app/

   RUN pip install -r requirements.txt

   COPY . /app

   EXPOSE 9696

   CMD ["gunicorn", "-b", "0.0.0.0:9696", "predict:app"]

Model Deployment:

To deploy the containerized application on Lambda, follow these steps: Guide for push container to ECR

Upload this container to ECR:

  # Build Dockerfile
  docker build -t webservice-model-moderation:v1 .

  # Remember to set up AWS CONFIGURE
  aws ecr create-repository --repository-name webservice-model-moderation:v1

Create a Lambda service using container in ECR

Deployment Workflow

Local Testing: Verify the model locally using Docker.
Push to Registry: Push the Docker image to a container registry.
Deploy to Kubernetes: Use Kubernetes to deploy the application.
Scale and Manage: Adjust replicas and resources as needed using Kubernetes.

Model Monitoring

Grafana for Monitoring

Description: Grafana is used for monitoring the performance of the deployed model. It provides real-time insights into the model's predictions, resource utilization, and alerts for any anomalies.

Key Features:

Interactive dashboards that visualize key metrics and logs.
Integration with Prometheus, Elasticsearch, and other data sources for comprehensive monitoring.
Custom alerting to notify the team about performance issues or data drift.

Alert System

Not implement

Reproducibility

Ensuring that the project is reproducible was a key focus. We followed best practices to make the code easy to run for anyone who wants to reproduce our results.

Steps for Reproduction

Clone the Repository:

git clone https://github.com/quzanh1130/chat_moderation_mlops
cd chat_moderation_mlops

Deploy Mlflow (Tracking experiment):

Follow this document or Youtube video:
Youtube: MLOps Zoomcamp 2.6 - MLflow in practice. By DataTalksClub
Document: Mlflow set up on AWS

Make Baseline:

Run baseline code to have the init model and explore about data

  cd baseline

  conda create -n chat python=3.8 -y
  conda activate chat
  pip install -r requirements.txt

Run base baseline_model_chat_removal.ipynb notebook file.

Download data
Explore Data
Preprocess Data
Make Train, Val, Test
Select Model
Hyper Paramater
Set up AWS Configure
Tracking experiment with Mlflow
Using Evidently to make Report Data and Model

Set up Mage (Orchestation):
Set up Grafana and Evident (Monitoring):

Not implement
Set up CI/CD and Deploy Webservice:

Not implement

Code Formatter and Linter

Not implement

Makefile

 implement

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
aws_setup		aws_setup
baseline		baseline
monitoring		monitoring
orchestation		orchestation
resource		resource
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chat Moderation ML Project

Table of Contents

Introduction

Problem Description

Solution Approach

Dataset

Data Overview

About Dataset

Technologies Used

Experiment Tracking and Model Registry

Experiment Tracking with MLFlow

Model Registry

Workflow Orchestration

Mage for Orchestration

Pipeline Setup with Mage

Model Deployment

Deployment Strategy

Deployment Process

Deployment Workflow

Model Monitoring

Grafana for Monitoring

Alert System

Reproducibility

Steps for Reproduction

Code Formatter and Linter

Makefile

About

Releases

Packages

Languages

quzanh1130/chat_moderation_mlops

Folders and files

Latest commit

History

Repository files navigation

Chat Moderation ML Project

Table of Contents

Introduction

Problem Description

Solution Approach

Dataset

Data Overview

About Dataset

Technologies Used

Experiment Tracking and Model Registry

Experiment Tracking with MLFlow

Model Registry

Workflow Orchestration

Mage for Orchestration

Pipeline Setup with Mage

Model Deployment

Deployment Strategy

Deployment Process

Deployment Workflow

Model Monitoring

Grafana for Monitoring

Alert System

Reproducibility

Steps for Reproduction

Code Formatter and Linter

Makefile

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages