Creating Custom Meta Extractors

Introduction

Meta extractors are microservices that analyze files and extract metadata. They communicate with the core system through RabbitMQ message queues, making them easy to add, remove, or modify without affecting the rest of the system.

Key Features

Language-agnostic: Write extractors in any language that supports RabbitMQ
Hot-pluggable: Add or remove extractors without system restart
Independent scaling: Run multiple instances of the same extractor
Fault-tolerant: Failed extractions don't affect the core system

Architecture Overview

Meta extractors follow a simple publish/subscribe pattern:

Core system publishes file information to RabbitMQ queue
Extractors subscribe to relevant queues based on file types
Extractors process files and extract metadata
Results are published back to a response queue
Core system processes and stores the extracted metadata

Registration & File Type Advertisement

Before processing files, meta extractors must register themselves with the meta manager and advertise their supported file types. This allows the meta manager to route files to the appropriate extractors.

Registration Process

Extractor connects to RabbitMQ
Sends registration message with supported file types
Meta manager acknowledges registration
Meta manager starts routing matching files to the extractor

Basic Structure

With the provided examples in /extractors on the Github repository, one needs to change only two things:

# Process the text file

##########################EDITME##############################

##### CHANGE THIS FUNCTION TO EXTRACT METADATA FROM THE FILE #####

##############################################################

tags = process_text_file(local_file_path)
tags = dedupe_tags(tags)

The process_text_file function is the one that needs to be changed to extract metadata from the file. Change its extraction process anyway, just return a list of tags for the resource.

# Register this module with its supported extensions

registration_message = {
    'module_id': module_id,
    'supported_extensions': ['.txt']  # Text extractor only supports .txt files
        #####################EDITME#############################
        ##### EDIT THIS BASED ON SUPPORTED FILE EXTENSIONS #####
        ########################################################
}

The registration_message is the one that needs to be changed to register the module with its supported extensions. Edit this based on the supported file extensions.

Optional: Change the MODULE_ID to something that represents the module. The default python script will check if it is available and free, if not it will get another free from meta manager.

Important Notes

Each extractor instance gets a unique ID and queue
Multiple extractors can support the same file types
Meta manager uses round-robin distribution for files with multiple supporting extractors
Extractors can update their supported types by re-registering

Implementation Guide

1. Basic Structure

Create a new Python file based on the /extractors/txt/main.py structure.

2. Install dependencies

Install the dependencies by running pip install -r requirements.txt based on the /extractors/txt/requirements.txt.

This is just an example, these are the bare minimum, so you will need to add yours to these.

3. Configuration

Create a .env file based on the .env.example from the /extractors/txt folder.

RABBIT_HOST=127.0.0.1
RABBIT_USER=om-processor
RABBIT_PASS=om-processor
RABBIT_VHOST=/
RABBIT_PORT=5672

4. Running the extractor

Run the extractor by running python main.py.

Connecting to the System

Steps to Connect:

Ensure your extractor is running and connected to RabbitMQ
The system will automatically detect new extractors
Test your extractor using the CLI:

om --upload test_file.jpg  # Upload a test file
om --rescan test_file.jpg  # Force metadata extraction