- Use conda or virtualenv for instance to simplify the installation. For conda use
conda env create -f environment.yaml
- Change the torch version in requirements.txt to match your CUDA version.
- Install dependencies:
pip install -r requirements.txt
(if you didn't use conda). - Install package:
pip install -e .
.
- Install Tesseract: https://tesseract-ocr.github.io/tessdoc/Downloads.html (or delete it from requirements.txt, if you don't want to use it).
- You might need to add the path to the tesseract install folder in the PATH env variable.
- It also needs language data (https://tesseract-ocr.github.io/tessdoc/Data-Files.html), so if there is an error please download and add the requested file in the requested folder.
The goal of this project is to extract labels from those images. The dataset was provided by our teacher and is not publicly available. It consists of 115 images without labels (i.e. no bounding boxes).
This is a labeled dataset of 4,370 images of herbarium specimens. We use it to train a YOLOv11 model. To build this dataset, use python build_dataset.py
.
Important: The website was down for some days. If it also happens to you, the script won't work, so you won't be able to train a YOLOv11 model.
You can then use YOLOv5 directly
This dataset has the same purpose has "French Herbarium Without Labels". We took the first 100 images of the validation split of the MELU Object Detection dataset, so that we don't have too many images. "French Herbarium Without Labels" was useful to test the model on a dataset that is very different from the one used for training and test OCR methods on French texts. On the other hand, this dataset is in English and has the same format as the training split, so we get better results.
A folder yolo_models, with yolo models (.pt) is expected in the root directory.
- YOLOv5 fine-tuned: https://figshare.unimelb.edu.au/articles/dataset/_strong_Data_available_for_Identification_of_herbarium_specimen_sheet_components_from_high-resolution_images_using_deep_learning_YOLOv5_Best_model_weights_for_MELU_trained_object_detection_model_strong_/23597034?file=41395557
- YOLOv11 fine-tuned (ours): Use the notebook
train_yolov11.ipynb
to train it on the dataset "MELU Object Detection" (see below).
We used hyperparameters from https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.10395, that is:
- Number of epochs: 200 (with early stopping, but the patience was not given)
- Model:
yolo11l.pt
(YOLOv11 large) - Image size: 640 pixels
Like in the paper associated to the YOLO v5 version, we take 1000 images for validation and the rest for training.
Note that if albumentations is installed, YOLO will automatically use it for data augmentation (blur...). Since those augmentations were not used in the paper, and are irrelevant for our dataset we didn't install it.
Check the notebook train_yolov11.ipynb
for more details.
python extract_labels.py
will extract label images. Basically it uses a YOLO model to extract bounding boxes and then crops the images using those.
Check the notebook extract_labels.ipynb
for more details.
You can either use the notebooks or the ocr classes in the ocr module.
python -m streamlit run src/herbarium_labels_extractor/start_app.py