Comparison: Basic versus Layout Mode #327

JorjMcKie · 2025-11-17T10:11:35Z

JorjMcKie
Nov 17, 2025
Maintainer

When using PyMuPDF with package PyMuPDF-Layout, many new features become available under PyMuPDF4LLM.

Use the following import statements in your scripts to activate layout execution mode. The first statement is mandatory and the sequence of statements is important.

import pymupdf.layout
import pymupdf4llm  # must be after importing pymupdf.layout

If you are using PyMuPDF-Pro, code your imports as follows:

import pymupdf.layout
import pymupdf.pro
import pymupdf4llm

pymupdf.pro.unlock(key)

New Features

New method .to_text() creates the plain text version of the text. Tables are written using package tabulate.
New method .to_json() creates a JSON version of the document's metadata, together with the selected pages.
Improved detection of tables
Improved detection of text paragraphs
Improved detection of titles and section headers
Improved detection of list item hierarchies
New: page header and footer detection, making the margins parameter obsolete
Dynamic OCR invocation: detect whether a page needs OCR and invoke Tesseract. If original text is present but unreadable (too many � characters), only OCR the text -- not the full page. For every page that is being OCRed, a message is issued saying whether a "full-page" or "text-only" is happening.
Use of tqdm when available. Use it instead of the built-in progress meter. This improves usage in GUI and Jupyter environments.

Postponed and Obsolete Features

The following table shows parameter availabilities of all three methods to_markdown(), to_text() and to_json().

The entries in the Comments column have the following meaning:

ignored: no meaning in layout mode and ignored when used
postponed: waiting to be ported to layout mode

Parameter	markdown	plain text	json	Comments
doc	✔️	✔️	✔️
header	✔️	✔️	ignored	new: replaces `margins`
footer	✔️	✔️	ignored	new: replaces `margins`
detect_bg_color	❌	❌	❌	ignored
dpi	✔️	✔️	✔️
embed_images	✔️	✔️	✔️
extract_words	🔜	🔜	🔜	postponed
filename	✔️	✔️	✔️
fontsize_limit	❌	❌	❌	ignored
force_text	✔️	✔️	✔️
graphics_limit	❌	❌	❌	ignored
hdr_info	❌	❌	❌	ignored
ignore_alpha	❌	❌	❌
ignore_code	✔️	✔️	✔️
ignore_graphics	❌	❌	❌	ignored
ignore_images	❌	❌	❌	ignored
image_format	✔️	✔️	✔️
image_path	✔️	✔️	✔️
image_size_limit	❌	❌	❌	ignored
margins	❌	❌	❌	ignored
page_chunks	🔜	❌	❌	postponed
page_height	🔜	🔜	🔜	postponed
page_separators	🔜	❌	❌	postponed to version 0.2.3
page_width	🔜	🔜	🔜	postponed
pages	✔️	✔️	✔️
show_progress	✔️	✔️	✔️	uses tqdm if available
table_strategy	❌	❌	❌	ignored
use_glyphs	❌	❌	❌	always output �
write_images	✔️	✔️	✔️

Unavailable Features in Layout Mode

In addition to ignored parameters shown in above table, some features are not unavailable when PyMuPDF-Layout is active.

Class IdentifyHeaders is unavailable in layout mode. Titles and section headers are detected with a much higher precision. This is not dependent on things like font size -- the approach used in basic mode. However, there is no way to retrieve section header levels. Therefore, only two markdown header level tags are used, "#" for titles and "##" for section headers.
Class TocHeaders is unavailable in layout mode. Titles and section headers are exclusively detected by PyMuPDF-Layout.

General Comments

While PyMuPDF-Layout is AI-empowered, it is different from most other tools that employ artificial intelligence.

It is not vision-based: its models do not depend on rendered page images.

Its Graph Neural Networks are directly based on PDF internals, thus combining precision with an up to 10 times higher speed.

Here is a short list of the characteristics:

Local execution: everything happens on your machine - no access to the internet or other external resources. Your data are never leaked to the outside.
Resource efficiency: only uses the CPU, not the GPU. There are no extraordinary hardware requirements.
High performance: up to 10 times faster than vision-based tools.
Small footprint: package size is around 15 to 20 MB - not dozens of gigabytes.
Fast installation: a matter of seconds - not half an hour.

JorjMcKie · 2025-11-17T11:28:46Z

JorjMcKie
Nov 17, 2025
Maintainer Author

Here is an example of a Jupyter notebook execution dynamically using package tqdm:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparison: Basic versus Layout Mode #327

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Comparison: Basic versus Layout Mode #327

Uh oh!

Uh oh!

JorjMcKie Nov 17, 2025 Maintainer

New Features

Postponed and Obsolete Features

Unavailable Features in Layout Mode

General Comments

Replies: 1 comment

Uh oh!

JorjMcKie Nov 17, 2025 Maintainer Author

JorjMcKie
Nov 17, 2025
Maintainer

JorjMcKie
Nov 17, 2025
Maintainer Author