Skip to content

Python package with GUI to extract text labels from herbarium images using different YOLO models and OCR engines

Notifications You must be signed in to change notification settings

RobinMeneust/dl_project_herbarium

Repository files navigation

dl_project_herbarium

Install

  1. Use conda or virtualenv for instance to simplify the installation. For conda use conda env create -f environment.yaml
  2. Change the torch version in requirements.txt to match your CUDA version.
  3. Install dependencies: pip install -r requirements.txt (if you didn't use conda).
  4. Install package: pip install -e ..

Tesseract

  1. Install Tesseract: https://tesseract-ocr.github.io/tessdoc/Downloads.html (or delete it from requirements.txt, if you don't want to use it).
  2. You might need to add the path to the tesseract install folder in the PATH env variable.
  3. It also needs language data (https://tesseract-ocr.github.io/tessdoc/Data-Files.html), so if there is an error please download and add the requested file in the requested folder.

Datasets

French Herbarium Without Labels

The goal of this project is to extract labels from those images. The dataset was provided by our teacher and is not publicly available. It consists of 115 images without labels (i.e. no bounding boxes).

MELU Object Detection

This is a labeled dataset of 4,370 images of herbarium specimens. We use it to train a YOLOv11 model. To build this dataset, use python build_dataset.py. Important: The website was down for some days. If it also happens to you, the script won't work, so you won't be able to train a YOLOv11 model. You can then use YOLOv5 directly

MELU Object Detection - Validation Split 100 First Images

This dataset has the same purpose has "French Herbarium Without Labels". We took the first 100 images of the validation split of the MELU Object Detection dataset, so that we don't have too many images. "French Herbarium Without Labels" was useful to test the model on a dataset that is very different from the one used for training and test OCR methods on French texts. On the other hand, this dataset is in English and has the same format as the training split, so we get better results.

Usage

YOLO Model

A folder yolo_models, with yolo models (.pt) is expected in the root directory.

YOLO Model Fine-tuning

We used hyperparameters from https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.10395, that is:

  • Number of epochs: 200 (with early stopping, but the patience was not given)
  • Model: yolo11l.pt (YOLOv11 large)
  • Image size: 640 pixels

Like in the paper associated to the YOLO v5 version, we take 1000 images for validation and the rest for training.

Note that if albumentations is installed, YOLO will automatically use it for data augmentation (blur...). Since those augmentations were not used in the paper, and are irrelevant for our dataset we didn't install it.

Check the notebook train_yolov11.ipynb for more details.

Label Images Extraction

python extract_labels.py will extract label images. Basically it uses a YOLO model to extract bounding boxes and then crops the images using those. Check the notebook extract_labels.ipynb for more details.

Text Extraction From Label Images

You can either use the notebooks or the ocr classes in the ocr module.

App

python -m streamlit run src/herbarium_labels_extractor/start_app.py

About

Python package with GUI to extract text labels from herbarium images using different YOLO models and OCR engines

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •