[WIP] glutton_fatcat consolidation support#508
Draft
bnewbold wants to merge 1775 commits intogrobidOrg:masterfrom
Draft
[WIP] glutton_fatcat consolidation support#508bnewbold wants to merge 1775 commits intogrobidOrg:masterfrom
bnewbold wants to merge 1775 commits intogrobidOrg:masterfrom
Conversation
update lexicon
* Use numerical mapping when ocr is not activated.
Improvement in evaluation framework
Avoid duplicated body part in the abstract
Improved dehypenisation
fatcat_ident, wikidata_qid. also pass arxiv_id around in more places for consistency.
This changes link order to: - arxiv.org: always available/reliable - web link in URL: could be better than DOI-based match (eg, if website) - fatcat.wiki: should be a superset of other OA links, and more reliable/stable - unpaywall OA link: better than doi, though links not as stable over time - doi.org: fallback
Unless in code review we decide to actually rename this variable for legibility.
This commit *should not* be merged into master!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR probably needs a bit of work, but i'm posting here for an early review if I have taken the correct approach.
These patches add a new consolidation option:
glutton_fatcat. This calls a patched version of biblio-glutton which returns metadata in the fatcat 'release' schema (instead of the Crossref API schema). The motivation is to support additional works which may not have Crossref DOIs (eg, some JALC or Datacite DOIs not in the Crossref API bulk corpus, or things like arxiv papers with no DOIs at all). The motivation to upstream these changes here is to avoid having to maintain a separate patchset, and to also include some small improvements.In addition to the consolidation option, these patches include better support for the
rawNameattribute, some extra setters/getters for identifiers (and support for Wikidata QIDs), and a change in how URLs are output.This entire code path can be publicly tested at: https://grobid.qa.fatcat.wiki
The
glutton_fatcatbiblio-glutton code can be browsed at https://github.com/bnewbold/biblio-glutton, on the branch 'fatcat'. See long comment thread in that repo: kermitt2/biblio-glutton#33. My changes tobiblio-gluttonwill probably be harder to merge upstream, but I'm happy to try. I don't think the API (between GROBID/biblio-glutton) would need to change significantly, so these GROBID-side patches should work fine.