Meta extractors are microservices that analyze files and extract metadata. They communicate with the core system through RabbitMQ message queues, making them easy to add, remove, or modify without affecting the rest of the system.
Meta extractors follow a simple publish/subscribe pattern:
Before processing files, meta extractors must register themselves with the meta manager and advertise their supported file types. This allows the meta manager to route files to the appropriate extractors.
With the provided examples in /extractors on the Github repository, one needs to change only two things:
tags = process_text_file(local_file_path)
tags = dedupe_tags(tags)
The process_text_file function is the one that needs to be changed to extract metadata from the file. Change its extraction process anyway, just return a list of tags for the resource.
registration_message = {
'module_id': module_id,
'supported_extensions': ['.txt'] # Text extractor only supports .txt files
#####################EDITME#############################
##### EDIT THIS BASED ON SUPPORTED FILE EXTENSIONS #####
########################################################
}
The registration_message is the one that needs to be changed to register the module with its supported extensions. Edit this based on the supported file extensions.
Optional: Change the MODULE_ID to something that represents the module. The default python script will check if it is available and free, if not it will get another free from meta manager.
Create a new Python file based on the /extractors/txt/main.py structure.
Install the dependencies by running pip install -r requirements.txt based on the /extractors/txt/requirements.txt.
This is just an example, these are the bare minimum, so you will need to add yours to these.
Create a .env file based on the .env.example from the /extractors/txt folder.
RABBIT_HOST=127.0.0.1
RABBIT_USER=dp-processor
RABBIT_PASS=dp-processor
RABBIT_VHOST=/
RABBIT_PORT=5672
Run the extractor by running python main.py.
om --upload test_file.jpg # Upload a test file
om --rescan test_file.jpg # Force metadata extraction