Arabic OCR Experiments

Summary

This repository contains experiments aimed at extracting text from Arabic language images using OCR (Optical Character Recognition) tools. The task involves recognizing Arabic text from various input images, enhancing OCR performance, and evaluating the extracted text accuracy using standard metrics:

Character Error Rate (CER): Measures the number of character-level errors (insertions, deletions, substitutions).
Word Error Rate (WER): Measures the number of word-level errors (insertions, deletions, substitutions).

The repository explores three approaches:

EasyOCR: A pretrained model for OCR.
Enhanced EasyOCR: EasyOCR with additional preprocessing and utility functions.
Tesseract-OCR: An open-source OCR engine with a pretrained Arabic model.

Experiments Overview

Experiment #	Model	Accuracy (CER, WER)	Notes
1	Easy-OCR	CER: 0.28, WER: 0.57	EasyOCR pretrained model with a function to sort Arabic text.
2	Easy-OCR Enhanced	CER: 0.26, WER: 0.49	EasyOCR with preprocessing: sorting into lines, word arrangement, and numeral replacement.
3	Tesseract-OCR	CER: 0.06, WER: 0.41	Tesseract pretrained Arabic model with preprocessing and Tesseract engine.

Requirements

All dependencies are listed in the requirements.yaml file. To set up the environment, use the following commands:

Install Dependencies

Install Python dependencies:
```
pip install -r requirements.yaml
```
Install Tesseract OCR:
- Windows: Download the installer from Tesseract's GitHub.
- Linux: Install via package manager:
```
sudo apt-get install tesseract-ocr
```
- MacOS: Use Homebrew:
```
brew install tesseract
```
Install EasyOCR:
```
pip install easyocr
```

Experiments Details

Experiment 1: EasyOCR Pretrained Model

Objective: Test EasyOCR's Arabic pretrained model for OCR tasks.
Features:
- Extracts text from images using EasyOCR.
- Includes a function to sort Arabic text into proper reading order.
Results:
- CER: 0.28
- WER: 0.57

Experiment 2: EasyOCR with Utilities

Objective: Enhance EasyOCR by adding utility functions for better Arabic text recognition.
Features:
- Additional preprocessing functions:
  - Sorts text into lines.
  - Arranges words within lines.
  - Replaces English numerals with Arabic numerals.
Results:
- CER: 0.26
- WER: 0.49

Experiment 3: Tesseract OCR Pretrained Model

Objective: Use Tesseract's pretrained Arabic OCR model for text recognition.
Features:
- Preprocessing steps:
  - Grayscale conversion.
  - Binarization using Otsu's thresholding.
- Extracts text using Tesseract's OCR engine.
Results:
- CER: 0.06
- WER: 0.41

Usage Instructions

Clone the Repository

git clone https://github.com/Arabic-OCR-Pretrained-Models.git
cd arabic-ocr-experiments

Run Experiments
- Navigate to the desired experiment directory.
- Update file paths and ground truth text in the script as needed.
- Run the corresponding Python script:
```
python main.py
```

Next Steps

Test OCR performance on more challenging datasets.
Explore fine-tuning pretrained models for improved Arabic text recognition.
Implement additional preprocessing techniques (e.g., deskewing, denoising).
Compare results with custom-trained OCR models for Arabic.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Arabic easyocr enhanced		Arabic easyocr enhanced
Arabic easyocr		Arabic easyocr
Tesseract Arabic OCR		Tesseract Arabic OCR
README.md		README.md
Requirements.yaml		Requirements.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic OCR Experiments

Summary

Experiments Overview

Requirements

Install Dependencies

Experiments Details

Experiment 1: EasyOCR Pretrained Model

Experiment 2: EasyOCR with Utilities

Experiment 3: Tesseract OCR Pretrained Model

Usage Instructions

Next Steps

About

Releases

Packages

Languages

KenanSh/Arabic-OCR-Pretrained-Models

Folders and files

Latest commit

History

Repository files navigation

Arabic OCR Experiments

Summary

Experiments Overview

Requirements

Install Dependencies

Experiments Details

Experiment 1: EasyOCR Pretrained Model

Experiment 2: EasyOCR with Utilities

Experiment 3: Tesseract OCR Pretrained Model

Usage Instructions

Next Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages