transcript-pl

transcript-pl is a Node.js application designed to process document files (such as PDFs) by converting them into images and extracting text using OCR technologies (Tesseract and Google Vision). It supports AI-based transcription and can be configured for multiple languages.

Features

PDF to image conversion
OCR with Tesseract and Google Vision
AI-powered transcription (OpenAI GPT)
Configurable via YAML file
CLI interface for flexible usage

Setup

1. Clone the repository

git clone <repository-url>
cd transcript-pl

2. Install dependencies

npm install

3. Download Tesseract traineddata files

npm run setup:tessdata

This script downloads the required .traineddata files for Tesseract OCR into resources/tessdata.

4. Configure the application

Edit the configuration file at src/config/config.yaml to set your API keys, languages, and other options.

For Google Vision, set the path to your API key JSON file.
For OpenAI, set your API key and model.

Usage

Run the application from the command line:

node src/app.js --document <path-to-document> [options]

CLI Arguments

Argument	Alias	Type	Description	Required
--document	-d	string	Path to the document file to process	Yes
--pages	-p	string	List of pages to process via AI and image generation (e.g., `1,3,5`)	No
--lang	-l	string	Source language code for the document (e.g., `pl`, `en`, `fr`). Used for all OCR engines.	No
--target-langs	-t	string	Comma-separated list of target languages for AI translation (e.g., `fr,en,de`)	No
--example	-e	string	Path to an example file to improve AI transcription	No
--docType	-D	string	Type of document (e.g., `mémoire historique`, `acte de naissance`). Used for AI prompt.	Yes

Example:

node src/app.js --document input/sample.pdf --pages 1,2,3 --lang pl --target-langs fr,en --docType "mémoire historique"

Configuration

All main settings are in src/config/config.yaml:

lang:
- source: language code of the document (used for OCR, required)
- target: array of language codes for AI translation (optional, can be empty)
docType: Not in YAML! Always provide via CLI.
ai: Enable/disable AI, set OpenAI API key, model, etc.
google_vision: Enable/disable, set API key path.
tesseract: Enable/disable.
pdf_to_image: Set DPI for image conversion.

How it works

The PDF is converted to images, and each page is processed by all enabled OCR engines (Tesseract, Google Vision).
For each page, the AI (OpenAI GPT) improves the text in the source language using all available OCR transcriptions.
If target languages are specified, the improved text is then translated by the AI into each target language.
Result:
- If no target language is set, you get only the improved text in the source language.
- If target languages are set, you get both the improved source text and AI translations for each target language.

Scripts

npm run setup:tessdata — Downloads Tesseract .traineddata files for supported languages.

Dependencies

Note: Make sure you have the required API keys for Google Vision and OpenAI before running the application.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
input		input
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

transcript-pl

Features

Setup

1. Clone the repository

2. Install dependencies

3. Download Tesseract traineddata files

4. Configure the application

Usage

CLI Arguments

Configuration

How it works

Scripts

Dependencies

About

Uh oh!

Releases

Packages

Languages

christophe-bazin/transcript

Folders and files

Latest commit

History

Repository files navigation

transcript-pl

Features

Setup

1. Clone the repository

2. Install dependencies

3. Download Tesseract traineddata files

4. Configure the application

Usage

CLI Arguments

Configuration

How it works

Scripts

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages