Skip to content

Bug: PDF Parser Skipping Entire Page During Extraction and bibliographic info error #1326

@POTUSAITEJA

Description

@POTUSAITEJA

Grobid version

0.8.2

Operating System and architecture (arm64, amd64, x86, etc.)

Windows (Docker)

What is your Java version

No response

Log and information

While parsing a multi-page PDF, the parser unexpectedly skips over a complete page (e.g., Page 1 of 3). The extracted content jumps directly to the next available page, ignoring the skipped one entirely. There is no error or warning thrown—just silent failure. I am facing this on this pdf --> Effect of Normalizing Temperature on Microstructure, Texture and Magnetic Properties of Non-Oriented Silicon Steel ( doi 10.3390/met15020217). See the parsed content below complete page 2 from the pdf is ignored and page 3 info is directly added to the page 1.

<div xmlns="http://www.tei-c.org/ns/1.0">
<head n="1." xml:id="_CjjNgAu">Introduction</head>
<p xml:id="_RGzqaS3">
Non-oriented silicon steel, as an important soft magnetic material with an annual production of tens of millions of tons, is widely used in the cores of large motors and generators due to its excellent electromagnetic properties and extremely high commercial value. It is extensively applied in the fields of electricity, electronics, and the military industry
<ref type="bibr" coords="1,207.66,721.39,10.72,9.58" target="#b0">[1]</ref>
. The rapid development of new energy vehicles has witnessed the gradual replacement of traditional fuels by low-carbon, green, and clean energy. A drive motor is a device that converts electrical energy into kinetic energy; its key components are the stator and rotor cores that are made of non-oriented silicon steels
<ref type="bibr" coords="1,433.10,763.18,10.90,9.58" target="#b1">[2,</ref>
<ref type="bibr" coords="1,444.01,763.18,7.27,9.58" target="#b2">3]</ref>
. In order to ensure that for the optimization of heat treatment processes and the improvement in the magnetic properties of non-oriented silicon steels.
</p>
</div>

Further information

There are also plenty of errors in the detailed bibliographic info created by Grobid. For Example in the same pdf look at reference 21, This is the output given by Grobid

<biblStruct coords="12,43.19,756.30,516.08,8.63;12,57.23,768.95,219.70,8.74" xml:id="b20">
<analytic>
<title level="a" type="main" coords="12,202.17,756.30,357.10,8.63;12,57.23,769.07,54.89,8.63" xml:id="_WqUvK6r">Influence Mechanisms of Cold Rolling Reduction Rate on Microstructure, Texture and Magnetic Properties of Non-Oriented Silicon Steel</title>
<author>
<persName>
<forename type="first">Feihu</forename>
<surname>Guo</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">Yuhao</forename>
<surname>Niu</surname>
</persName>
<idno type="ORCID">0009-0004-0456-5452</idno>
</author>
<author>
<persName>
<forename type="first">Bing</forename>
<surname>Fu</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">Jialong</forename>
<surname>Qiao</surname>
</persName>
<idno type="ORCID">0000-0003-3917-0982</idno>
</author>
<author>
<persName>
<forename type="first">Shengtao</forename>
<surname>Qiu</surname>
</persName>
</author>
<idno type="DOI">10.3390/cryst14100853</idno>
</analytic>
<monogr>
<title level="j" xml:id="_6mpxhhd" coords="12,118.27,768.95,86.40,8.55">Crystals</title>
<title level="j" type="abbrev">Crystals</title>
<idno type="ISSNe">2073-4352</idno>
<imprint>
<biblScope unit="volume">14</biblScope>
<biblScope unit="issue">10</biblScope>
<biblScope unit="page">853</biblScope>
<date type="published" when="2024-09-29">2024</date>
<publisher>MDPI AG</publisher>
</imprint>
</monogr>
<note type="raw_reference">Guo, F.H.; Shi, P.Z.; Li, Z.C.; Qiu, S.T. Influence mechanism of solidification structure of silicon steel castings on hot rolling texture of silicon steel. J. Anhui Univ. Technol. 2024, 41, 441-449.</note>
</biblStruct>

The paper titled "Influence Mechanisms of Cold Rolling Reduction Rate on Microstructure, Texture and Magnetic Properties of Non-Oriented Silicon Steel" was not cited anywhere in the source document and look at the raw ref it's different. However, GROBID incorrectly included it as Reference 21 in the consolidate_citations output.

This is misleading, and we’ve observed multiple instances of similar false positives. Is there a known fix or recommended way to improve citation accuracy in such cases?

This is my running code

from grobid_client.grobid_client import GrobidClient
import time

path_pdfs ="data"
path_TEI_xml="Paragraph/"

client = GrobidClient(config_path="config.json")
start_time = time.time()
client.process(service="processFulltextDocument", 
               segment_sentences=False, 
               input_path=path_pdfs, 
               output=path_TEI_xml,
               consolidate_citations=True, 
               tei_coordinates=True, 
               force=True,
                n=10,
                generateIDs=True,
                consolidate_header=True,
                include_raw_citations=True,
                include_raw_affiliations=True,
                verbose=True)
runtime = round(time.time() - start_time, 3)
print("runtime: %s seconds " % (runtime))

and my config file

{
    "grobid_server": "http://gpuserv:8070/",
    "batch_size": 1,
    "sleep_time": 10,
    "timeout": 1000,
    "coordinates": [ "persName", "figure", "ref", "biblStruct", "formula", "s", "note", "title"]
}

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions