Handle USFM and Text corpora separately in pre-processing by pmachapman · Pull Request #894 · sillsdev/serval

pmachapman · 2026-03-26T04:21:28Z

Fixes #893.

I will write an e2e test on Monday, but I thought I should gather feedback on this PR.

This PR fixes a regression from #882 where text and scripture corpora were being combined, resulting in a crash when comparing a MultiKeyRef to a ScriptureKeyRef. I tried initially just updating those structs to compare to each other, but although that fixed the crash, the text corpora rows disappeared as they weren't scripture. And so I settled on iterating over them separately.

@Enkidu93 Please let me know if I have (or have not) gone about the right way of fixing this bug.

This change is

codecov-commenter · 2026-03-26T04:23:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.98%. Comparing base (c57e306) to head (ba7dca8).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #894      +/-   ##
==========================================
+ Coverage   67.94%   67.98%   +0.03%     
==========================================
  Files         386      386              
  Lines       21205    21224      +19     
  Branches     2736     2740       +4     
==========================================
+ Hits        14408    14429      +21     
+ Misses       5812     5810       -2     
  Partials      985      985

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Enkidu93

OK! Yeah, we should add a mixed (text+Paratext) test wherever we already have Paratext/text tests and no such mixed test - including extending the Nmt_Paratext() test.

I think this is the right way to solve it. The other option, like you say, would be to make the refs comparable, but you'd have to adjust the scripture filtering to preserve rows that are either non-scripture-refs or scripture refs that refer to verses - i.e., something like

        return parallelTextRow.Ref is not ScriptureRef sr || sr.IsVerse;

and filter all training corpora with that, not just the scripture corpora. I think that could work as well. In some ways, I think that would be cleaner, but I'm sure Damien has an opinion.

@Enkidu93 reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on ddaspit).

ddaspit

I'm not sure what the broader implications would be of making different kinds of refs comparable, so I'm okay with this solution.

@ddaspit reviewed 1 file and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on pmachapman).

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):

                sourceInferencingCorpus,
                targetInferencingCorpus,
                targetCorpus!,

I don't think targetCorpus can be null, unless I'm missing something.

pmachapman

@pmachapman made 1 comment.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on ddaspit and Enkidu93).

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think targetCorpus can be null, unless I'm missing something.

Done. Thank you! (This was a hangover from an earlier attempt I made to fix the bug.)

ddaspit

@ddaspit reviewed 1 file and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on pmachapman).

pmachapman · 2026-03-30T01:30:34Z

@ddaspit @Enkidu93 I have added an e2e test that will test USFM with additional text files. I tested it on main, and it failed with the error from Serval QA (as it should without this PR's changes).

ddaspit

@ddaspit reviewed 4 files and all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on pmachapman).

Enkidu93

Great, thanks! Just one comment.

@Enkidu93 reviewed 5 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on pmachapman).

src/Serval/test/Serval.E2ETests/ServalApiTests.cs line 219 at r3 (raw file):

    [TestCase(false)]
    [TestCase(true)]
    public async Task Nmt_Paratext(bool withAdditionalFiles)

How much time is it adding to double this test? Could we just always add the additional files? Do you see a benefit to testing both separately in an E2E test?

pmachapman

@pmachapman made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on Enkidu93).

src/Serval/test/Serval.E2ETests/ServalApiTests.cs line 219 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

How much time is it adding to double this test? Could we just always add the additional files? Do you see a benefit to testing both separately in an E2E test?

It adds about 2-3 minutes to the e2e test run time.

The main benefit is that the two tests are both different user scenarios/configurations. That was my main reason for writing a separate test, rather than replacing the existing test.

Enkidu93

@Enkidu93 made 1 comment and resolved 1 discussion.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on pmachapman).

src/Serval/test/Serval.E2ETests/ServalApiTests.cs line 219 at r3 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

It adds about 2-3 minutes to the e2e test run time.

The main benefit is that the two tests are both different user scenarios/configurations. That was my main reason for writing a separate test, rather than replacing the existing test.

OK, that makes sense. I do wonder if there are any errors that can be exposed by the first test that wouldn't also be exposed by the second 🤔. But I'm OK with leaving both 👍.

pmachapman requested review from Enkidu93 and ddaspit March 26, 2026 04:21

pmachapman marked this pull request as ready for review March 26, 2026 04:26

Enkidu93 approved these changes Mar 26, 2026

View reviewed changes

ddaspit requested changes Mar 26, 2026

View reviewed changes

Handle USFM and Text corpora separately in pre-processing

cab499a

pmachapman force-pushed the separate-usfm-text-corpora branch from cbaea8d to cab499a Compare March 26, 2026 19:13

pmachapman commented Mar 26, 2026

View reviewed changes

ddaspit approved these changes Mar 27, 2026

View reviewed changes

Added e2e test for additional text files

ba7dca8

ddaspit approved these changes Mar 30, 2026

View reviewed changes

Enkidu93 requested changes Apr 1, 2026

View reviewed changes

pmachapman commented Apr 1, 2026

View reviewed changes

Enkidu93 approved these changes Apr 1, 2026

View reviewed changes

pmachapman merged commit 7cf7238 into main Apr 1, 2026
2 checks passed

pmachapman deleted the separate-usfm-text-corpora branch April 1, 2026 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle USFM and Text corpora separately in pre-processing#894

Handle USFM and Text corpora separately in pre-processing#894
pmachapman merged 2 commits intomainfrom
separate-usfm-text-corpora

pmachapman commented Mar 26, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 26, 2026 •

edited

Loading

Uh oh!

Enkidu93 left a comment •

edited

Loading

Uh oh!

ddaspit left a comment

Uh oh!

pmachapman left a comment

Uh oh!

ddaspit left a comment

Uh oh!

pmachapman commented Mar 30, 2026

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

pmachapman left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

pmachapman commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Enkidu93 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

pmachapman left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

pmachapman commented Mar 30, 2026

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

pmachapman left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pmachapman commented Mar 26, 2026 •

edited

Loading

codecov-commenter commented Mar 26, 2026 •

edited

Loading

Enkidu93 left a comment •

edited

Loading