Tool to extract and convert PDF document data into XML metadata that can be restructured using XSL templates.
Use this tool to extract unstructured text data from standard IEEE formatted conference papers (PDF files) that can be filtered and exported to different formats, such as BITS or WordPress XML data. Metadata extraction from PDF documents is stored in intermediate JSON and XML files that can be exported to defined schemas using XSL templates. Extraction requires index files that map raw text scraped from the PDF files to known metadata.
In addition to the PDF documents to be extracted, three different input index files are required to use this tool. Example index files can be found in the input folder.
paths.json: Provides paths to the PDF articles, the output paths, as well asbase.jsonandarticles.csv. Use the template provided in the source root.base.json: Base Metadata (JSON) to provide general information about the conference proceeding and sessions.articles.csv: Articles Metadata (CSV) to provide an index of articles, authors and affiliations in the proceeding.
Using the Tika library, the extractor module (extractor.py) extracts text from the PDF documents and maps them
using the index data to any useful data extracted from the PDF files and writes the data to intermediate JSON files.
All input and output paths are defined in the paths.json index file.
The following folders are generated for saved output. The output paths are defined in paths.json configuration file.
- Extraction
articles: JSON metadata files generated for each PDF article.xml: XML metadata files generated for each PDF article.logs: Error logs for issues found during extraction.patches: Overwrite the generated metadata by making updated copies of JSON metadata files here.txt: The raw text generated from the PDF extraction.
python main.py [FILEPATH paths.json] -extract
Raw extracted data can be edited and updated by running the update tool. Edit the raw text directly and run the updater to regenerate the JSON metadata files. Alternatively, edit the
python main.py [FILEPATH paths.json] -update
Data in intermediate metadata files generated through extraction can be exported to XML using the builder module (build.py). This module has three defined schema options.
The following schemas are available, however this package can be extended to include other schemas using an XSL stylesheet and a valid XSD schema definition for validation.
- [BITS - Book Interchange Tag Set: JATS Extension] (https://jats.nlm.nih.gov/extensions/bits/) Generates XML metadata for the ACM proceeding Digital Library using BITS schema.. The intent of the BITS is to provide a common format in which publishers and archives can exchange book content, including book parts such as chapters. The Suite provides a set of XML schema modules that define elements and attributes for describing the textual and graphical content of books and book components as well as a package for book part interchange.
- [DataCite - Metadata for DOIs] (https://schema.datacite.org/) Generates DataCite metadata XML files for each article. The DataCite Metadata Schema is a list of core metadata properties chosen for an accurate and consistent identification of a resource for citation and retrieval purposes, along with recommended use instructions.
- [WordPress - Importer XML Schema] (https://wordpress.org/support/article/importing-content/) Generates WordPress import XML from extracted data. Using the WordPress Import tool, you can import content into your site using this schema option.
python main.py <path to paths.json> -build -bits
python main.py <path to paths.json> -build -datacite
python main.py <path to paths.json> -build -wordpress