Skip to content

christophe-bazin/transcript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

transcript-pl

transcript-pl is a Node.js application designed to process document files (such as PDFs) by converting them into images and extracting text using OCR technologies (Tesseract and Google Vision). It supports AI-based transcription and can be configured for multiple languages.

Features

  • PDF to image conversion
  • OCR with Tesseract and Google Vision
  • AI-powered transcription (OpenAI GPT)
  • Configurable via YAML file
  • CLI interface for flexible usage

Setup

1. Clone the repository

git clone <repository-url>
cd transcript-pl

2. Install dependencies

npm install

3. Download Tesseract traineddata files

npm run setup:tessdata

This script downloads the required .traineddata files for Tesseract OCR into resources/tessdata.

4. Configure the application

Edit the configuration file at src/config/config.yaml to set your API keys, languages, and other options.

  • For Google Vision, set the path to your API key JSON file.
  • For OpenAI, set your API key and model.

Usage

Run the application from the command line:

node src/app.js --document <path-to-document> [options]

CLI Arguments

Argument Alias Type Description Required
--document -d string Path to the document file to process Yes
--pages -p string List of pages to process via AI and image generation (e.g., 1,3,5) No
--lang -l string Source language code for the document (e.g., pl, en, fr). Used for all OCR engines. No
--target-langs -t string Comma-separated list of target languages for AI translation (e.g., fr,en,de) No
--example -e string Path to an example file to improve AI transcription No
--docType -D string Type of document (e.g., mémoire historique, acte de naissance). Used for AI prompt. Yes

Example:

node src/app.js --document input/sample.pdf --pages 1,2,3 --lang pl --target-langs fr,en --docType "mémoire historique"

Configuration

All main settings are in src/config/config.yaml:

  • lang:
    • source: language code of the document (used for OCR, required)
    • target: array of language codes for AI translation (optional, can be empty)
  • docType: Not in YAML! Always provide via CLI.
  • ai: Enable/disable AI, set OpenAI API key, model, etc.
  • google_vision: Enable/disable, set API key path.
  • tesseract: Enable/disable.
  • pdf_to_image: Set DPI for image conversion.

How it works

  • The PDF is converted to images, and each page is processed by all enabled OCR engines (Tesseract, Google Vision).
  • For each page, the AI (OpenAI GPT) improves the text in the source language using all available OCR transcriptions.
  • If target languages are specified, the improved text is then translated by the AI into each target language.
  • Result:
    • If no target language is set, you get only the improved text in the source language.
    • If target languages are set, you get both the improved source text and AI translations for each target language.

Scripts

  • npm run setup:tessdata — Downloads Tesseract .traineddata files for supported languages.

Dependencies


Note: Make sure you have the required API keys for Google Vision and OpenAI before running the application.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published