This repository contains experiments aimed at extracting text from Arabic language images using OCR (Optical Character Recognition) tools. The task involves recognizing Arabic text from various input images, enhancing OCR performance, and evaluating the extracted text accuracy using standard metrics:
- Character Error Rate (CER): Measures the number of character-level errors (insertions, deletions, substitutions).
- Word Error Rate (WER): Measures the number of word-level errors (insertions, deletions, substitutions).
The repository explores three approaches:
- EasyOCR: A pretrained model for OCR.
- Enhanced EasyOCR: EasyOCR with additional preprocessing and utility functions.
- Tesseract-OCR: An open-source OCR engine with a pretrained Arabic model.
Experiment # | Model | Accuracy (CER, WER) | Notes |
---|---|---|---|
1 | Easy-OCR | CER: 0.28, WER: 0.57 | EasyOCR pretrained model with a function to sort Arabic text. |
2 | Easy-OCR Enhanced | CER: 0.26, WER: 0.49 | EasyOCR with preprocessing: sorting into lines, word arrangement, and numeral replacement. |
3 | Tesseract-OCR | CER: 0.06, WER: 0.41 | Tesseract pretrained Arabic model with preprocessing and Tesseract engine. |
All dependencies are listed in the requirements.yaml
file. To set up the environment, use the following commands:
- Install Python dependencies:
pip install -r requirements.yaml
- Install Tesseract OCR:
- Windows: Download the installer from Tesseract's GitHub.
- Linux: Install via package manager:
sudo apt-get install tesseract-ocr
- MacOS: Use Homebrew:
brew install tesseract
- Install EasyOCR:
pip install easyocr
- Objective: Test EasyOCR's Arabic pretrained model for OCR tasks.
- Features:
- Extracts text from images using EasyOCR.
- Includes a function to sort Arabic text into proper reading order.
- Results:
- CER: 0.28
- WER: 0.57
- Objective: Enhance EasyOCR by adding utility functions for better Arabic text recognition.
- Features:
- Additional preprocessing functions:
- Sorts text into lines.
- Arranges words within lines.
- Replaces English numerals with Arabic numerals.
- Additional preprocessing functions:
- Results:
- CER: 0.26
- WER: 0.49
- Objective: Use Tesseract's pretrained Arabic OCR model for text recognition.
- Features:
- Preprocessing steps:
- Grayscale conversion.
- Binarization using Otsu's thresholding.
- Extracts text using Tesseract's OCR engine.
- Preprocessing steps:
- Results:
- CER: 0.06
- WER: 0.41
-
Clone the Repository
git clone https://github.com/Arabic-OCR-Pretrained-Models.git cd arabic-ocr-experiments
-
Run Experiments
- Navigate to the desired experiment directory.
- Update file paths and ground truth text in the script as needed.
- Run the corresponding Python script:
python main.py
- Test OCR performance on more challenging datasets.
- Explore fine-tuning pretrained models for improved Arabic text recognition.
- Implement additional preprocessing techniques (e.g., deskewing, denoising).
- Compare results with custom-trained OCR models for Arabic.