REDACT is a novel application for the automatic, AI-powered, universal redaction of sensitive information across text, PDFs, images, audio, and video file types, complemented by a robust fine-tuning process, guardrails, and content safety mechanisms.
REDACT offers the following key features:
- Redacting text: Redact text and
.txtfiles, with offline usage supported, across 116 categories of Personally Identifiable Information (PII). Example categories include account & banking information, Personal information like names and contact information. - Redacting PDFs: Runs Optical Character Recognition (OCR) to extract all text, followed by the aforementioned redaction approach. All 116 categories are supported.
- Redacting Images: Runs OCR again to extract text, followed by the same redaction approach. All 116 categories are supported. Also runs a computer vision model to detect faces in the image, and redacts them.
- Redacting Audio: Runs Speech-to-Text, followed by the same redaction approach. All 116 categories are supported.
- Redacting Videos: Runs Azure Video Indexer to upload the submitted video to an Azure service, redacts all faces, and returns a URL for the same.
- Varying Degrees of Redaction: Different degrees of redaction are supported for text, PDFs, and images, based on the user's needs.
- Fine-tuning the system to your need: After sufficient redactions, previous classifications can be used to fine-tune the model to suit user's needs.
- An easy-to-use Frontend: The application's simple and beautiful frontend allows quick access to redaction, as well as other tools to enhance its quality.
- Text: The agent for all text-based redaction is based on a custom fine-tuned DeBERTa LLM, on token classification into 116 categories, for extracting PII.
- PDFs: Uses Azure Document Intelligence Read API (OCR), that returns text and bounding boxes. The extracted text is then run through the agent, and redaction marks are drawn on the files.
- Images: Similar to PDFs, uses Azure Document Intelligence followed by the agent to redact text. Faces are redacted in images using YOLO (You-Only-Look-Once) v8.
- Audio: Uses Azure Speech's Fast Transcription (Speech-to-Text) API, followed by the agent that selects the text to be redacted.
- Videos: Uses Azure Video Indexer API to redact faces found in videos.
- Degrees of Redaction: Three degrees of redaction are provided wherever the agent is used, that is, for text, PDFs, and images. This works by separating the 116 supported categories into three groups.
- Model Training:The agent can be further fine-tuned, using previously classified data, that is logged via Django Models.
- Content Safety: Implemented through Azure Content Safety that blocks text, PDFs, images, and audio with hate, self-harm, sexual content, and violence.
- Guardrails: Includes simple guardrails for always redacting proper nouns (through NLTK), numbers, URLs, and emails.
- Process Overview: The frontend showcases the process overview that achieved the redaction. This portrays the various roles played by the agent and the guardrails.
- Other enhancements: The user can enter regex patterns and a custom list of words to be redacted.
- Content Safety for Videos
- Threaded processes for redaction services
- API services
- HuggingFace: Runs and trains the agent for classifying PII.
- Azure Document Intelligence: Runs OCR on PDFs and images.
- Azure Speech Service: Runs Speech-to-Text on audio.
- Azure Video Indexer: Redacts faces in videos.
- Azure Content Safety: Implements content safety on text, PDFs, and images.
- Ultralytics: Runs YOLO, for redacting faces in images.
- Django: Backend framework.
- Azure Service Key Configuration: Service keys for Azure Document Intelligence, Speech Service, Video Indexer, and Content Safety must be entered in
redact/app/services/service_keys.json. Document Intelligence, Speech Service, and Content Safety only require an endpoint and a key. Video Indexer requires the name, ID, a subscription ID, and the endpoint. It also requires an Azure Storage Account. - Install all requirements in
requirements.txt - Run the Django app: Run the application via
python redact/app/manage.py runserver.
Please check TRANSPARENCY_FAQS.md for more information on responsible AI use.
├── redact
│ ├── app
│ │ ├── migrations
│ │ │ ├── __init__.py
│ │ │ ├── 0001_initial.py
│ │ │ ├── 0002_rename_classes_modeltrainingdata_label.py
│ │ ├── services
│ │ │ ├── agents.py
│ │ │ ├── db_service.py
│ │ │ ├── guardrails.py
│ │ │ ├── model_service.py
│ │ │ ├── model_training.py
│ │ │ ├── service_keys.json
│ │ │ ├── utils.py
│ │ ├── static
│ │ │ ├── images
│ │ │ │ ├── arrow-up.png
│ │ │ ├── index.css
│ │ ├── templates
│ │ │ ├── index.html
│ │ ├── __init__.py
│ │ ├── admin.py
│ │ ├── apps.py
│ │ ├── models.py
│ │ ├── tests.py
│ │ ├── urls.py
│ │ ├── views.py
│ ├── models
│ │ ├── config.json
│ │ ├── merges.txt
│ │ ├── model.safetensors
│ │ ├── special_tokens_map.json
│ │ ├── tokenizer_config.json
│ │ ├── tokenizer.json
│ │ ├── vocab.json
│ ├── redact
│ │ ├── __init__.py
│ │ ├── asgi.py
│ │ ├── settings.py
│ │ ├── urls.py
│ │ ├── wsgi.py
│ ├── yolo
│ │ ├── yolov8n_100e.pt
│ ├── manage.py
├── .gitattributes
├── .gitignore
├── LICENSE.md
├── README.md
├── TRANSPARENCY_FAQS.md
└── requirements.txt-
redact/app/services/: Contains redaction related modules.
agents.py: Manages the agent, the DeBERTa LLM used in the application.db_service.py: Handles database operations for storing classifications, that can be used to fine-tune the agent later.guardrails.py: Implements guardrails for redaction services.model_service.py: Manages the redaction services and workflows for text, PDFs, images, and videos.model_training.py: Handles fine-tuning the model.service_keys.json: Stores service keys for all Azure services.utils.py: Contains utility functions used across the application.
-
redact/models/: Contains the agent's configuration and weights.
-
redact/yolo/: Contains the model weights for YOLO.
-
redact/manage.py/: Used to run the Django application.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.