Collection Mining – Entomological Label Information Extraction ¶

A Python package developed at the Berlin Natural History Museum

Overview ¶

This package provides a modular framework for the semi-automated processing of entomological specimen labels. It uses artificial intelligence to perform label detection, classification, rotation correction, OCR, and clustering laying the groundwork for comprehensive information extraction. It is designed to work in conjunction with the python-mfnb package for downstream clustering tasks.

Key Features ¶

AI-Powered Label Classification: Three TensorFlow-based classifiers tailored to different label types.
OCR Pipeline: Supports both Tesseract and the Google Cloud Vision API.
Modular Components: For classification, preprocessing, text extraction, and postprocessing.
High Efficiency: Optimized for digitizing large-scale entomological collections.

For full methodological details, see our upcoming paper:

Margot Belot et al. (in preparation), A Semi-Automated Pipeline for High Throughput Information Extraction of Insect Specimen Labels

Or browse our online documentation.

Installation ¶

Create a Python 3.10 environment (recommended to ensure dependency compatibility):

conda create –name ELIE python=3.10
conda activate ELIE

Clone the repository:

git clone https://github.com/MfN-Berlin/label_processing.git
cd label_processing

Install the package:

pip install .

Install Tesseract (optional, required if using Tesseract OCR):

Ubuntu/Debian:

sudo apt install tesseract-ocr

macOS:

brew install tesseract

Input Image Guidelines ¶

The modules work best on JPEG images that adhere to standardized practices, such as those from:

Recommended image specifications:

High-resolution JPEG format (300 DPI)
Clear separation between labels
Horizontal text alignment
No insects or other elements in the image
Consistent label positioning across images
Preferably black background (white is acceptable)

Google Cloud Vision Setup ¶

To use the Google Vision API:

Create a Google Cloud account.
Follow the setup instructions here: Google Vision API setup.
Generate and download a credentials JSON file.
Pass this file as an input to the vision.py script.

Installing zbar for QR Code Recognition ¶

To enhance QR code detection using zbar, install the following dependencies:

macOS:

brew install zbar

Linux:

sudo apt-get install libzbar0

On Windows, zbar is already bundled with the Python binaries.

Docker Compose Pipeline Execution ¶

This repository includes Dockerfiles for each processing module, as well as a Docker Compose setup to orchestrate them.

Available Compose Modes:

Multi-label: Full pipeline including label detection.
Single-label: Runs the pipeline without label detection.

Usage:

From the root directory:

Linux:

docker compose -f .yaml up –build

macOS:

ocker-compose -f .yaml up –build

Shared volumes are handled via the data/ directory.

Contact ¶

For questions or contributions, please contact:

Margot Belot – margot.belot@mfn.berlin
Leonardo Preuss – preuss.leonardo@mfn.berlin