A simple command-line tool to convert PDF and EPUB files into Markdown format using the Mistral AI OCR API. This tool also extracts embedded images and saves them in a subdirectory relative to the output markdown file.
You can install the package directly from PyPI using pip:
pip install mistral-pdf-to-markdownIf you want to use the pdf2md command from anywhere in your system without activating a specific virtual environment, the recommended way is to use pipx:
-
Install
pipx(if you don't have it already). Follow the official pipx installation guide. A common method is:python3 -m pip install --user pipx python3 -m pipx ensurepath
(Restart your terminal after running
ensurepath) -
Install the package using
pipx:pipx install mistral-pdf-to-markdown
This installs the package in an isolated environment but makes the pdf2md command globally available.
Alternatively, if you want to install from the source:
-
Clone the repository:
git clone https://github.com/arcangelo7/mistral-pdf-to-markdown.git cd mistral-pdf-to-markdown -
Install dependencies using Poetry:
poetry install
To convert EPUB files, you need to install pandoc. See the official installation guide for your operating system.
-
Set your Mistral API Key: You can set your API key as an environment variable:
export MISTRAL_API_KEY='your_api_key_here'
Alternatively, you can create a
.envfile in the project root directory with the following content:MISTRAL_API_KEY=your_api_key_hereYou can also pass the API key directly using the
--api-keyoption. -
Run the conversion:
The
convertcommand processes a single PDF or EPUB file.poetry run pdf2md convert <path/to/your/document.pdf> [options]
Or, if you have activated the virtual environment (
poetry shell):pdf2md convert <path/to/your/document.pdf> [options]
Options for Single File Conversion:
--outputor-o: Specify the path for the output Markdown file. If not provided, it defaults to the same name as the input file but with a.mdextension (e.g.,document.md).--api-key: Provide the Mistral API key directly.
The
convert-dircommand processes all PDF and EPUB files in a specified directory.poetry run pdf2md convert-dir <path/to/directory/with/files> [options]
Or, if you have activated the virtual environment (
poetry shell):pdf2md convert-dir <path/to/directory/with/files> [options]
Options for Directory Conversion:
--output-diror-o: Specify the directory where output Markdown files will be saved. If not provided, it defaults to the same directory as the input files.--api-key: Provide the Mistral API key directly.--max-workersor-w: Maximum number of concurrent conversions (default: 2). Increase this value to process multiple files in parallel for faster conversion.
Image Handling:
The script will attempt to extract images embedded in the document.
- Images are saved in a subdirectory named
<output_filename_stem>_images(e.g., if the output isreport.md, images will be inreport_images/). - The generated Markdown file will contain relative links pointing to the images in this subdirectory.
Examples:
# Convert a single PDF file (output: ./my_report.md)
poetry run pdf2md convert ./my_report.pdf
# Convert with custom output path
poetry run pdf2md convert ./my_report.pdf -o ./output/report.md
# Convert all files in a directory with 4 concurrent workers
poetry run pdf2md convert-dir ./documents/ -o ./markdown_output/ -w 4An example output generated from example.pdf (included in the repository) can be found in example.md, with its corresponding images located in the example_images/ directory.
This project is licensed under the ISC License - see the LICENSE file for details.