Creating Custom Meta Extractors

Introduction

Meta extractors are microservices that analyze files and extract metadata. They communicate with the core system through RabbitMQ message queues, making them easy to add, remove, or modify without affecting the rest of the system.

Key Features

  • Language-agnostic: Write extractors in any language that supports RabbitMQ
  • Hot-pluggable: Add or remove extractors without system restart
  • Independent scaling: Run multiple instances of the same extractor
  • Fault-tolerant: Failed extractions don't affect the core system

Architecture Overview

Meta extractors follow a simple publish/subscribe pattern:

  1. Core system publishes file information to RabbitMQ queue
  2. Extractors subscribe to relevant queues based on file types
  3. Extractors process files and extract metadata
  4. Results are published back to a response queue
  5. Core system processes and stores the extracted metadata

Registration & File Type Advertisement

Before processing files, meta extractors must register themselves with the meta manager and advertise their supported file types. This allows the meta manager to route files to the appropriate extractors.

Registration Process

  1. Extractor connects to RabbitMQ
  2. Sends registration message with supported file types
  3. Meta manager acknowledges registration
  4. Meta manager starts routing matching files to the extractor

Basic Structure

With the provided examples in /extractors on the Github repository, one needs to change only two things:

# Process the text file
##########################EDITME##############################
##### CHANGE THIS FUNCTION TO EXTRACT METADATA FROM THE FILE #####
##############################################################
tags = process_text_file(local_file_path)
tags = dedupe_tags(tags)

The process_text_file function is the one that needs to be changed to extract metadata from the file. Change its extraction process anyway, just return a list of tags for the resource.

# Register this module with its supported extensions
registration_message = {
    'module_id': module_id,
    'supported_extensions': ['.txt']  # Text extractor only supports .txt files
        #####################EDITME#############################
        ##### EDIT THIS BASED ON SUPPORTED FILE EXTENSIONS #####
        ########################################################
}

The registration_message is the one that needs to be changed to register the module with its supported extensions. Edit this based on the supported file extensions.

Optional: Change the MODULE_ID to something that represents the module. The default python script will check if it is available and free, if not it will get another free from meta manager.

Important Notes

  • Each extractor instance gets a unique ID and queue
  • Multiple extractors can support the same file types
  • Meta manager uses round-robin distribution for files with multiple supporting extractors
  • Extractors can update their supported types by re-registering

Implementation Guide

1. Basic Structure

Create a new Python file based on the /extractors/txt/main.py structure.

2. Install dependencies

Install the dependencies by running pip install -r requirements.txt based on the /extractors/txt/requirements.txt.

This is just an example, these are the bare minimum, so you will need to add yours to these.

3. Configuration

Create a .env file based on the .env.example from the /extractors/txt folder.

RABBIT_HOST=127.0.0.1
RABBIT_USER=dp-processor
RABBIT_PASS=dp-processor
RABBIT_VHOST=/
RABBIT_PORT=5672

4. Running the extractor

Run the extractor by running python main.py.

Connecting to the System

Steps to Connect:

  1. Ensure your extractor is running and connected to RabbitMQ
  2. The system will automatically detect new extractors
  3. Test your extractor using the CLI:
om --upload test_file.jpg  # Upload a test file
om --rescan test_file.jpg  # Force metadata extraction