Skip to content

Missing or Possibly Incorrect Parsing of PDF #1345

@Keegan-Vaz

Description

@Keegan-Vaz

Grobid version

grobid/grobid:0.8.2-full - Docker - Deep Learning Model

Operating System and architecture (arm64, amd64, x86, etc.)

No response

What is your Java version

No response

Log and information

No response

Further information

Hello team,

We’ve noticed that GROBID is not parsing email addresses from some PDFs, even though the emails are clearly visible in the document. I’ve attached one such example where the email is right on the first page, but it isn’t captured in the extracted XML.

Additionally, for one of the PDFs, the author’s name was parsed as Mandarin characters, even though it appears in English in the original file. The corresponding PDF and the generated XML are both attached for reference.

Please let me know if you need any additional details or sample files to investigate this further.

Thanks!

file.pdf
tei.xml

The one below was parsed with the Authors in Mandarin Text
parsed_with_chinese.xml
English_PDF.pdf

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions