Missing or Possibly Incorrect Parsing of PDF

### Grobid version

grobid/grobid:0.8.2-full - Docker - Deep Learning Model 

### Operating System and architecture (arm64, amd64, x86, etc.)

_No response_

### What is your Java version

_No response_

### Log and information

_No response_

### Further information

Hello team,

We’ve noticed that GROBID is not parsing email addresses from some PDFs, even though the emails are clearly visible in the document. I’ve attached one such example where the email is right on the first page, but it isn’t captured in the extracted XML.

Additionally, for one of the PDFs, the author’s name was parsed as Mandarin characters, even though it appears in English in the original file. The corresponding PDF and the generated XML are both attached for reference.

Please let me know if you need any additional details or sample files to investigate this further.

Thanks!

[file.pdf](https://github.com/user-attachments/files/23092493/file.pdf)
[tei.xml](https://github.com/user-attachments/files/23092492/tei.xml)

The one below was parsed with the Authors in Mandarin Text
[parsed_with_chinese.xml](https://github.com/user-attachments/files/23092546/parsed_with_chinese.xml)
[English_PDF.pdf](https://github.com/user-attachments/files/23092547/English_PDF.pdf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing or Possibly Incorrect Parsing of PDF #1345

Grobid version

Operating System and architecture (arm64, amd64, x86, etc.)

What is your Java version

Log and information

Further information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing or Possibly Incorrect Parsing of PDF #1345

Description

Grobid version

Operating System and architecture (arm64, amd64, x86, etc.)

What is your Java version

Log and information

Further information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions