USE 364 - mitlibwebsite targeted fulltext extraction #268

ghukill · 2026-01-29T15:36:31Z

Purpose and background context

This PR updates how we extract full-text for the mitlibwebsite. After some analysis from DiscoEng, some logic was identified for HTML selectors we could use to grab container elements that contained text relevant to the website, excluding content that repeats for all pages like headers and footers.

NOTE: much of the file churn was updated dependencies and updated linting. This is encapsulated in a single commit. The meaningful changes can be found in this commit.

How can a reviewer manually see the effects of these changes?

Please see this USE-365 Jira ticket comment that links to a spreadsheet analyzing the results of the fulltext field after these changes were implemented.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: reduction of repeating and unhelpful text in the mitlibwebsite will improve search relevancy and reduce noise in the USE interface.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-364

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: It was decided that the full-text getting extracted from mitlibwebsite full HTML was too broad. We were collecting header and footer data that was not unique to the record/URL at hand. How this addresses that need: After some analysis by DiscoEng, some URL + element selector patterns were identified to target meaningful container elements. This has dramatically reduced the amount of full-text while increasing the quality at the same time. Side effects of this change: * mitlibwebsite TIMDEX records have higher quality fulltext field values Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-364

ghukill · 2026-01-29T16:11:35Z

transmogrifier/sources/json/mitlibwebsite.py

+            (True, {"class": "content-main"}),  # True = wildcard element
+            (True, {"class": "main-content"}),  # True = wildcard element


This was new syntax to me in BeautifulSoup4 (BS4): you can use True to wildcard match any element.

ehanson8

Looks good and a sensible change since clearly there is a lot of non-useful content in the unrefined full test

ehanson8 · 2026-01-30T19:14:31Z

tests/fixtures/mitlibwebsite/fulltext_libguides.html

Very helpfully formatted fixture!

ehanson8 · 2026-01-30T19:16:47Z

transmogrifier/sources/json/mitlibwebsite.py

+        Using the full-text from the entire page will include far too much content that
+        is not unique or relevant to the page at hand, including repeating header and
+        footer data.  Our approach may evolve over time, but this method aims to extract
+        only meaningful full-text from each record based on some simple rules and specific
+        container elements to look for.


Great context

ghukill added 2 commits January 29, 2026 10:07

Updated dependencies and black linting updates

010ddea

ghukill commented Jan 29, 2026

View reviewed changes

ghukill marked this pull request as ready for review January 29, 2026 16:11

ghukill requested a review from a team January 29, 2026 16:11

Add CODEOWNERS file

dd47d72

ehanson8 approved these changes Jan 30, 2026

View reviewed changes

ghukill merged commit 1addea6 into main Jan 30, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE 364 - mitlibwebsite targeted fulltext extraction #268

USE 364 - mitlibwebsite targeted fulltext extraction #268

Uh oh!

ghukill commented Jan 29, 2026 •

edited

Loading

Uh oh!

ghukill Jan 29, 2026

Uh oh!

ehanson8 left a comment

Uh oh!

ehanson8 Jan 30, 2026

Uh oh!

ehanson8 Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		(True, {"class": "content-main"}), # True = wildcard element
		(True, {"class": "main-content"}), # True = wildcard element

USE 364 - mitlibwebsite targeted fulltext extraction #268

USE 364 - mitlibwebsite targeted fulltext extraction #268

Uh oh!

Conversation

ghukill commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

ghukill Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

ehanson8 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

ehanson8 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ghukill commented Jan 29, 2026 •

edited

Loading