Octopii – An AI-powered Personal Identifiable Information (PII) Scanner

Octopii is an open-source AI-powered Personal Identifiable Information (PII) scanner that can look for image assets such as Government IDs, passports, photos and signatures in a directory.

The image is imported via OpenCV and Python Imaging Library (PIL) and is cleaned, deskewed and rotated for scanning.

Best case (score >=90): The image is sent into the image classifier algorithm to be scanned for features such as an ISO/IEC 7810 card specification, colors, location of text, photos, holograms etc. If it is successfully classified as a type of PII, OCR is performed on it looking for particular words and strings as a final check. When both of these are confirmed, the result from Octopii is extremely reliable.

Average case (score >=50): The image is partially/incorrectly identified by the image classifier algorithm, but an OCR check finds contradicting substrings and reclassifies it.

Worst case (score >=0): The image is only identified by the image classifier algorithm but an OCR scan returns no results.

Incorrect classification: False positives due to a very small model or OCR list may incorrectly classify PIIs, giving inaccurate results.

As a final verification method, images are scanned for certain strings to verify the accuracy of the model.

The accuracy of the scan can determined via the confidence scores in output. If all the mentioned conditions are met, a score of 100.0 is returned.

To train the model, data can also be fed into the model_generator.py script, and the newly improved h5 file can be used.

Octopii currently supports local scanning and scanning S3 directories and open directory listings via their URLs.

Tip: segregate your image assets into folders with the folder name being the same as the class name. You can then drag and drop a folder into the upload dialog.

Note: Only upload the same as the class name, for example, the German Passport class must have German Passport pictures. Uploading the wrong data to the wrong class will confuse the machine learning algorithms.

The images used for the model above are not visible to us since they’re in a proprietary format. You can use both dummy and actual PII. Make sure they are square-ish in image size.

Once you generate models using Teachable Machine, you can improve Octopii’s accuracy via OCR. To do this:

You can replace each file you modify in the models/ directory after you create or edit them via the above methods.

Submit a pull request from your forked repo and we’ll pick it up and replace our current model with it if the changes are large enough.

Note: Please take the following steps to ensure quality

(c) Copyright 2022 RedHunt Labs Private Limited

Author: Owais Shaikh



Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top

Adblock Detected

Please consider supporting us by disabling your ad blocker

Refresh Page