How to Extract Alternative Text (Alt Text) from Images in PDF? #4764

krish-tech02 · 2025-10-26T16:27:27Z

krish-tech02
Oct 26, 2025

Description

I'm trying to extract the alternative text (alt text) from images embedded in PDF documents using PyMuPDF. Alt text is typically used for accessibility purposes in tagged PDFs (PDF/UA).

What I've Tried

I've attempted several approaches to extract the alt text, but none of them have worked successfully:

Approach 1: Using `xref_get_key`

import fitz

doc = fitz.open("sample.pdf")
page = doc[0]
images = page.get_images(full=True)

for img in images:
    xref = img[0]
    try:
        alt_text = doc.xref_get_key(xref, "Alt")
        print(f"Alt text: {alt_text}")
    except Exception as e:
        print(f"Error: {e}")

Approach 2: Checking ActualText

import fitz

doc = fitz.open("sample.pdf")
page = doc[0]
images = page.get_images(full=True)

for img in images:
    xref = img[0]
    try:
        # Check Alt key
        alt_key = doc.xref_get_key(xref, "Alt")
        if alt_key[0] == "string":
            print(f"Alt: {alt_key[1]}")
        
        # Check ActualText key
        actual_key = doc.xref_get_key(xref, "ActualText")
        if actual_key[0] == "string":
            print(f"ActualText: {actual_key[1]}")
    except Exception as e:
        print(f"Error: {e}")

Approach 3: Checking Structure Tree

import fitz

doc = fitz.open("sample.pdf")

# Check if PDF has structure tree
catalog = doc.pdf_catalog()
print(f"Has StructTreeRoot: {'StructTreeRoot' in catalog}")
print(f"Has MarkInfo: {'MarkInfo' in catalog}")

Results

None of the above approaches successfully extracted the alt text from my PDF, even though:

The PDF is tagged and accessible
The images definitely have alt text assigned

The xref_get_key method either returns None or throws exceptions when trying to access the Alt or ActualText keys.

Questions

Is there a supported way to extract alt text from images in PyMuPDF? None of my attempts have worked.
Does PyMuPDF currently support accessing alt text or the structure tree for tagged PDFs?
If this feature doesn't exist yet, are there plans to add support for accessibility metadata?
Is there any workaround or undocumented method to access this information?

Environment

PyMuPDF Version: 1.24.5
Python Version: Python 3.12.10

Sample PDF

I'm attaching a sample PDF file that contains images with alt text. The PDF is created with accessibility features (PDF/UA compliant).

Breast Care After Birth 10–02-2025 FR - Copy.pdf

[Attach your sample PDF file here]

Expected Behavior

I expect to be able to extract the alternative text associated with images in the PDF, similar to how it appears in Adobe Acrobat's accessibility checker or other PDF readers that support accessibility features.

Any guidance on accessing these properties through PyMuPDF would be greatly appreciated!

Thank you for maintaining this excellent library!
@JorjMcKie

JorjMcKie · 2025-10-26T17:58:12Z

JorjMcKie
Oct 26, 2025
Maintainer

We are planning to publish a new version soon that will support optimized layout analysis. The images in your example file would then appear as something like this:

BTW this PDF has no StrutureTreeRoot so there exists no such identification of that text as "alt-text" in the PDF itself! You seem to not have exported it from Word with the right options.

Nonetheless our layout analyzer has detected that text as "caption".

0 replies

krish-tech02 · 2025-10-27T04:46:31Z

krish-tech02
Oct 27, 2025
Author

@JorjMcKie, I am looking to fetch the alt text example shown in the screenshot below(this is from microsoft word):

0 replies

JorjMcKie · 2025-10-27T09:19:17Z

JorjMcKie
Oct 27, 2025
Maintainer

When exporting Word to PDF: which options do you choose?

1 reply

krish-tech02 Oct 29, 2025
Author

@JorjMcKie Sorry for the late reply, I am just using the Save as PDF option from a Word document.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to Extract Alternative Text (Alt Text) from Images in PDF? #4764

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to Extract Alternative Text (Alt Text) from Images in PDF? #4764

Uh oh!

krish-tech02 Oct 26, 2025

Description

What I've Tried

Approach 1: Using xref_get_key

Approach 2: Checking ActualText

Approach 3: Checking Structure Tree

Results

Questions

Environment

Sample PDF

Expected Behavior

Replies: 3 comments · 1 reply

Uh oh!

JorjMcKie Oct 26, 2025 Maintainer

Uh oh!

krish-tech02 Oct 27, 2025 Author

Uh oh!

JorjMcKie Oct 27, 2025 Maintainer

Uh oh!

Uh oh!

krish-tech02 Oct 29, 2025 Author

krish-tech02
Oct 26, 2025

Approach 1: Using `xref_get_key`

Replies: 3 comments 1 reply

JorjMcKie
Oct 26, 2025
Maintainer

krish-tech02
Oct 27, 2025
Author

JorjMcKie
Oct 27, 2025
Maintainer

krish-tech02 Oct 29, 2025
Author