-
Notifications
You must be signed in to change notification settings - Fork 0
USE 364 - mitlibwebsite targeted fulltext extraction #268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USE 364 - mitlibwebsite targeted fulltext extraction #268
Conversation
Why these changes are being introduced: It was decided that the full-text getting extracted from mitlibwebsite full HTML was too broad. We were collecting header and footer data that was not unique to the record/URL at hand. How this addresses that need: After some analysis by DiscoEng, some URL + element selector patterns were identified to target meaningful container elements. This has dramatically reduced the amount of full-text while increasing the quality at the same time. Side effects of this change: * mitlibwebsite TIMDEX records have higher quality fulltext field values Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-364
| (True, {"class": "content-main"}), # True = wildcard element | ||
| (True, {"class": "main-content"}), # True = wildcard element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was new syntax to me in BeautifulSoup4 (BS4): you can use True to wildcard match any element.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and a sensible change since clearly there is a lot of non-useful content in the unrefined full test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very helpfully formatted fixture!
| Using the full-text from the entire page will include far too much content that | ||
| is not unique or relevant to the page at hand, including repeating header and | ||
| footer data. Our approach may evolve over time, but this method aims to extract | ||
| only meaningful full-text from each record based on some simple rules and specific | ||
| container elements to look for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great context
Purpose and background context
This PR updates how we extract full-text for the
mitlibwebsite. After some analysis from DiscoEng, some logic was identified for HTML selectors we could use to grab container elements that contained text relevant to the website, excluding content that repeats for all pages like headers and footers.NOTE: much of the file churn was updated dependencies and updated linting. This is encapsulated in a single commit. The meaningful changes can be found in this commit.
How can a reviewer manually see the effects of these changes?
Please see this USE-365 Jira ticket comment that links to a spreadsheet analyzing the results of the
fulltextfield after these changes were implemented.Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES: reduction of repeating and unhelpful text in the
mitlibwebsitewill improve search relevancy and reduce noise in the USE interface.What are the relevant tickets?
Code review