Collection Mining – Entomological Label Information Extraction

A Python package developed at the Berlin Natural History Museum

Overview

This package provides a modular framework for the semi-automated processing of entomological specimen labels. It uses artificial intelligence to perform label detection, classification, rotation correction, OCR, and clustering laying the groundwork for comprehensive information extraction. It is designed to work in conjunction with the python-mfnb package for downstream clustering tasks.

Key Features

  • AI-Powered Label Classification: Three TensorFlow-based classifiers tailored to different label types.

  • OCR Pipeline: Supports both Tesseract and the Google Cloud Vision API.

  • Modular Components: For classification, preprocessing, text extraction, and postprocessing.

  • High Efficiency: Optimized for digitizing large-scale entomological collections.

For full methodological details, see our upcoming paper:

Margot Belot et al. (in preparation), A Semi-Automated Pipeline for High Throughput Information Extraction of Insect Specimen Labels

Or browse our online documentation.

Installation

  1. Create a Python 3.10 environment (recommended to ensure dependency compatibility):

conda create –name ELIE python=3.10
conda activate ELIE
  1. Clone the repository:

git clone https://github.com/MfN-Berlin/label_processing.git
cd label_processing
  1. Install the package:

pip install .
  1. Install Tesseract (optional, required if using Tesseract OCR):

  • Ubuntu/Debian:

sudo apt install tesseract-ocr
  • macOS:

brew install tesseract

Input Image Guidelines

The modules work best on JPEG images that adhere to standardized practices, such as those from:

Recommended image specifications:

  • High-resolution JPEG format (300 DPI)

  • Clear separation between labels

  • Horizontal text alignment

  • No insects or other elements in the image

  • Consistent label positioning across images

  • Preferably black background (white is acceptable)

Google Cloud Vision Setup

To use the Google Vision API:

  1. Create a Google Cloud account.

  2. Follow the setup instructions here: Google Vision API setup.

  3. Generate and download a credentials JSON file.

  4. Pass this file as an input to the vision.py script.

Installing zbar for QR Code Recognition

To enhance QR code detection using zbar, install the following dependencies:

  • macOS:

brew install zbar
  • Linux:

sudo apt-get install libzbar0

On Windows, zbar is already bundled with the Python binaries.

Docker Compose Pipeline Execution

This repository includes Dockerfiles for each processing module, as well as a Docker Compose setup to orchestrate them.

Available Compose Modes:

  • Multi-label: Full pipeline including label detection.

  • Single-label: Runs the pipeline without label detection.

Usage:

From the root directory:

  • Linux:

docker compose -f .yaml up –build
  • macOS:

ocker-compose -f .yaml up –build

Shared volumes are handled via the data/ directory.

Contact

For questions or contributions, please contact: