dl_project_herbarium

Install

Use conda or virtualenv for instance to simplify the installation. For conda use conda env create -f environment.yaml
Change the torch version in requirements.txt to match your CUDA version.
Install dependencies: pip install -r requirements.txt (if you didn't use conda).
Install package: pip install -e ..

Tesseract

Install Tesseract: https://tesseract-ocr.github.io/tessdoc/Downloads.html (or delete it from requirements.txt, if you don't want to use it).
You might need to add the path to the tesseract install folder in the PATH env variable.
It also needs language data (https://tesseract-ocr.github.io/tessdoc/Data-Files.html), so if there is an error please download and add the requested file in the requested folder.

Datasets

French Herbarium Without Labels

The goal of this project is to extract labels from those images. The dataset was provided by our teacher and is not publicly available. It consists of 115 images without labels (i.e. no bounding boxes).

MELU Object Detection

This is a labeled dataset of 4,370 images of herbarium specimens. We use it to train a YOLOv11 model. To build this dataset, use python build_dataset.py. Important: The website was down for some days. If it also happens to you, the script won't work, so you won't be able to train a YOLOv11 model. You can then use YOLOv5 directly

MELU Object Detection - Validation Split 100 First Images

This dataset has the same purpose has "French Herbarium Without Labels". We took the first 100 images of the validation split of the MELU Object Detection dataset, so that we don't have too many images. "French Herbarium Without Labels" was useful to test the model on a dataset that is very different from the one used for training and test OCR methods on French texts. On the other hand, this dataset is in English and has the same format as the training split, so we get better results.

Usage

YOLO Model

A folder yolo_models, with yolo models (.pt) is expected in the root directory.

YOLOv5 fine-tuned: https://figshare.unimelb.edu.au/articles/dataset/_strong_Data_available_for_Identification_of_herbarium_specimen_sheet_components_from_high-resolution_images_using_deep_learning_YOLOv5_Best_model_weights_for_MELU_trained_object_detection_model_strong_/23597034?file=41395557
YOLOv11 fine-tuned (ours): Use the notebook train_yolov11.ipynb to train it on the dataset "MELU Object Detection" (see below).

YOLO Model Fine-tuning

We used hyperparameters from https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.10395, that is:

Number of epochs: 200 (with early stopping, but the patience was not given)
Model: yolo11l.pt (YOLOv11 large)
Image size: 640 pixels

Like in the paper associated to the YOLO v5 version, we take 1000 images for validation and the rest for training.

Note that if albumentations is installed, YOLO will automatically use it for data augmentation (blur...). Since those augmentations were not used in the paper, and are irrelevant for our dataset we didn't install it.

Check the notebook train_yolov11.ipynb for more details.

Label Images Extraction

python extract_labels.py will extract label images. Basically it uses a YOLO model to extract bounding boxes and then crops the images using those. Check the notebook extract_labels.ipynb for more details.

Text Extraction From Label Images

You can either use the notebooks or the ocr classes in the ocr module.

App

python -m streamlit run src/herbarium_labels_extractor/start_app.py

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
notebooks		notebooks
src/herbarium_labels_extractor		src/herbarium_labels_extractor
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
melu.yaml		melu.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dl_project_herbarium

Install

Tesseract

Datasets

French Herbarium Without Labels

MELU Object Detection

MELU Object Detection - Validation Split 100 First Images

Usage

YOLO Model

YOLO Model Fine-tuning

Label Images Extraction

Text Extraction From Label Images

App

About

Releases

Packages

Contributors 4

Languages

RobinMeneust/dl_project_herbarium

Folders and files

Latest commit

History

Repository files navigation

dl_project_herbarium

Install

Tesseract

Datasets

French Herbarium Without Labels

MELU Object Detection

MELU Object Detection - Validation Split 100 First Images

Usage

YOLO Model

YOLO Model Fine-tuning

Label Images Extraction

Text Extraction From Label Images

App

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages