REDACT: An AI-powered Universal Redaction Service

Description

REDACT is a novel application for the automatic, AI-powered, universal redaction of sensitive information across text, PDFs, images, audio, and video file types, complemented by a robust fine-tuning process, guardrails, and content safety mechanisms.

Key Features

REDACT offers the following key features:

Redacting text: Redact text and .txt files, with offline usage supported, across 116 categories of Personally Identifiable Information (PII). Example categories include account & banking information, Personal information like names and contact information.
Redacting PDFs: Runs Optical Character Recognition (OCR) to extract all text, followed by the aforementioned redaction approach. All 116 categories are supported.
Redacting Images: Runs OCR again to extract text, followed by the same redaction approach. All 116 categories are supported. Also runs a computer vision model to detect faces in the image, and redacts them.
Redacting Audio: Runs Speech-to-Text, followed by the same redaction approach. All 116 categories are supported.
Redacting Videos: Runs Azure Video Indexer to upload the submitted video to an Azure service, redacts all faces, and returns a URL for the same.
Varying Degrees of Redaction: Different degrees of redaction are supported for text, PDFs, and images, based on the user's needs.
Fine-tuning the system to your need: After sufficient redactions, previous classifications can be used to fine-tune the model to suit user's needs.
An easy-to-use Frontend: The application's simple and beautiful frontend allows quick access to redaction, as well as other tools to enhance its quality.

Technical Approach

Text: The agent for all text-based redaction is based on a custom fine-tuned DeBERTa LLM, on token classification into 116 categories, for extracting PII.
PDFs: Uses Azure Document Intelligence Read API (OCR), that returns text and bounding boxes. The extracted text is then run through the agent, and redaction marks are drawn on the files.
Images: Similar to PDFs, uses Azure Document Intelligence followed by the agent to redact text. Faces are redacted in images using YOLO (You-Only-Look-Once) v8.
Audio: Uses Azure Speech's Fast Transcription (Speech-to-Text) API, followed by the agent that selects the text to be redacted.
Videos: Uses Azure Video Indexer API to redact faces found in videos.
Degrees of Redaction: Three degrees of redaction are provided wherever the agent is used, that is, for text, PDFs, and images. This works by separating the 116 supported categories into three groups.
Model Training:The agent can be further fine-tuned, using previously classified data, that is logged via Django Models.
Content Safety: Implemented through Azure Content Safety that blocks text, PDFs, images, and audio with hate, self-harm, sexual content, and violence.
Guardrails: Includes simple guardrails for always redacting proper nouns (through NLTK), numbers, URLs, and emails.
Process Overview: The frontend showcases the process overview that achieved the redaction. This portrays the various roles played by the agent and the guardrails.
Other enhancements: The user can enter regex patterns and a custom list of words to be redacted.

Future Updates

Content Safety for Videos
Threaded processes for redaction services
API services

Frameworks & Cloud Technologies Used

HuggingFace: Runs and trains the agent for classifying PII.
Azure Document Intelligence: Runs OCR on PDFs and images.
Azure Speech Service: Runs Speech-to-Text on audio.
Azure Video Indexer: Redacts faces in videos.
Azure Content Safety: Implements content safety on text, PDFs, and images.
Ultralytics: Runs YOLO, for redacting faces in images.
Django: Backend framework.

Running the App

Azure Service Key Configuration: Service keys for Azure Document Intelligence, Speech Service, Video Indexer, and Content Safety must be entered in redact/app/services/service_keys.json. Document Intelligence, Speech Service, and Content Safety only require an endpoint and a key. Video Indexer requires the name, ID, a subscription ID, and the endpoint. It also requires an Azure Storage Account.
Install all requirements in requirements.txt
Run the Django app: Run the application via python redact/app/manage.py runserver.

Transparency FAQs

Please check TRANSPARENCY_FAQS.md for more information on responsible AI use.

Directory Structure

├── redact
│   ├── app
│   │    ├── migrations
│   │    │    ├── __init__.py
│   │    │    ├── 0001_initial.py
│   │    │    ├── 0002_rename_classes_modeltrainingdata_label.py
│   │    ├── services
│   │    │    ├── agents.py
│   │    │    ├── db_service.py
│   │    │    ├── guardrails.py
│   │    │    ├── model_service.py
│   │    │    ├── model_training.py
│   │    │    ├── service_keys.json
│   │    │    ├── utils.py
│   │    ├── static
│   │    │    ├── images
│   │    │    │    ├── arrow-up.png
│   │    │    ├── index.css
│   │    ├── templates
│   │    │    ├── index.html
│   │    ├── __init__.py
│   │    ├── admin.py
│   │    ├── apps.py
│   │    ├── models.py
│   │    ├── tests.py
│   │    ├── urls.py
│   │    ├── views.py
│   ├── models
│   │    ├── config.json
│   │    ├── merges.txt
│   │    ├── model.safetensors
│   │    ├── special_tokens_map.json
│   │    ├── tokenizer_config.json
│   │    ├── tokenizer.json
│   │    ├── vocab.json
│   ├── redact
│   │    ├── __init__.py
│   │    ├── asgi.py
│   │    ├── settings.py
│   │    ├── urls.py
│   │    ├── wsgi.py
│   ├── yolo
│   │    ├── yolov8n_100e.pt
│   ├── manage.py
├── .gitattributes
├── .gitignore
├── LICENSE.md
├── README.md
├── TRANSPARENCY_FAQS.md
└── requirements.txt

Main Files and Their Purposes

redact/app/services/: Contains redaction related modules.
- agents.py: Manages the agent, the DeBERTa LLM used in the application.
- db_service.py: Handles database operations for storing classifications, that can be used to fine-tune the agent later.
- guardrails.py: Implements guardrails for redaction services.
- model_service.py: Manages the redaction services and workflows for text, PDFs, images, and videos.
- model_training.py: Handles fine-tuning the model.
- service_keys.json: Stores service keys for all Azure services.
- utils.py: Contains utility functions used across the application.
redact/models/: Contains the agent's configuration and weights.
redact/yolo/: Contains the model weights for YOLO.
redact/manage.py/: Used to run the Django application.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

REDACT: An AI-powered Universal Redaction Service

Description

Key Features

Technical Approach

Future Updates

Frameworks & Cloud Technologies Used

Running the App

Transparency FAQs

Directory Structure

Main Files and Their Purposes

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
redact		redact
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
TRANSPARENCY_FAQS.md		TRANSPARENCY_FAQS.md
requirements.txt		requirements.txt

License

shashwatsaini/REDACT

Folders and files

Latest commit

History

Repository files navigation

REDACT: An AI-powered Universal Redaction Service

Description

Key Features

Technical Approach

Future Updates

Frameworks & Cloud Technologies Used

Running the App

Transparency FAQs

Directory Structure

Main Files and Their Purposes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages