Skip to content

Handle USFM and Text corpora separately in pre-processing#894

Merged
pmachapman merged 2 commits intomainfrom
separate-usfm-text-corpora
Apr 1, 2026
Merged

Handle USFM and Text corpora separately in pre-processing#894
pmachapman merged 2 commits intomainfrom
separate-usfm-text-corpora

Conversation

@pmachapman
Copy link
Copy Markdown
Collaborator

@pmachapman pmachapman commented Mar 26, 2026

Fixes #893.

I will write an e2e test on Monday, but I thought I should gather feedback on this PR.

This PR fixes a regression from #882 where text and scripture corpora were being combined, resulting in a crash when comparing a MultiKeyRef to a ScriptureKeyRef. I tried initially just updating those structs to compare to each other, but although that fixed the crash, the text corpora rows disappeared as they weren't scripture. And so I settled on iterating over them separately.

@Enkidu93 Please let me know if I have (or have not) gone about the right way of fixing this bug.


This change is Reviewable

@pmachapman pmachapman requested review from Enkidu93 and ddaspit March 26, 2026 04:21
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.98%. Comparing base (c57e306) to head (ba7dca8).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #894      +/-   ##
==========================================
+ Coverage   67.94%   67.98%   +0.03%     
==========================================
  Files         386      386              
  Lines       21205    21224      +19     
  Branches     2736     2740       +4     
==========================================
+ Hits        14408    14429      +21     
+ Misses       5812     5810       -2     
  Partials      985      985              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pmachapman pmachapman marked this pull request as ready for review March 26, 2026 04:26
Copy link
Copy Markdown
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! Yeah, we should add a mixed (text+Paratext) test wherever we already have Paratext/text tests and no such mixed test - including extending the Nmt_Paratext() test.

I think this is the right way to solve it. The other option, like you say, would be to make the refs comparable, but you'd have to adjust the scripture filtering to preserve rows that are either non-scripture-refs or scripture refs that refer to verses - i.e., something like

        return parallelTextRow.Ref is not ScriptureRef sr || sr.IsVerse;

and filter all training corpora with that, not just the scripture corpora. I think that could work as well. In some ways, I think that would be cleaner, but I'm sure Damien has an opinion.

@Enkidu93 reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on ddaspit).

Copy link
Copy Markdown
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the broader implications would be of making different kinds of refs comparable, so I'm okay with this solution.

@ddaspit reviewed 1 file and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on pmachapman).


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):

                sourceInferencingCorpus,
                targetInferencingCorpus,
                targetCorpus!,

I don't think targetCorpus can be null, unless I'm missing something.

@pmachapman pmachapman force-pushed the separate-usfm-text-corpora branch from cbaea8d to cab499a Compare March 26, 2026 19:13
Copy link
Copy Markdown
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmachapman made 1 comment.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on ddaspit and Enkidu93).


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think targetCorpus can be null, unless I'm missing something.

Done. Thank you! (This was a hangover from an earlier attempt I made to fix the bug.)

Copy link
Copy Markdown
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@ddaspit reviewed 1 file and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on pmachapman).

@pmachapman
Copy link
Copy Markdown
Collaborator Author

@ddaspit @Enkidu93 I have added an e2e test that will test USFM with additional text files. I tested it on main, and it failed with the error from Serval QA (as it should without this PR's changes).

Copy link
Copy Markdown
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit reviewed 4 files and all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on pmachapman).

Copy link
Copy Markdown
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! Just one comment.

@Enkidu93 reviewed 5 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on pmachapman).


src/Serval/test/Serval.E2ETests/ServalApiTests.cs line 219 at r3 (raw file):

    [TestCase(false)]
    [TestCase(true)]
    public async Task Nmt_Paratext(bool withAdditionalFiles)

How much time is it adding to double this test? Could we just always add the additional files? Do you see a benefit to testing both separately in an E2E test?

Copy link
Copy Markdown
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmachapman made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on Enkidu93).


src/Serval/test/Serval.E2ETests/ServalApiTests.cs line 219 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

How much time is it adding to double this test? Could we just always add the additional files? Do you see a benefit to testing both separately in an E2E test?

It adds about 2-3 minutes to the e2e test run time.

The main benefit is that the two tests are both different user scenarios/configurations. That was my main reason for writing a separate test, rather than replacing the existing test.

Copy link
Copy Markdown
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 1 comment and resolved 1 discussion.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on pmachapman).


src/Serval/test/Serval.E2ETests/ServalApiTests.cs line 219 at r3 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

It adds about 2-3 minutes to the e2e test run time.

The main benefit is that the two tests are both different user scenarios/configurations. That was my main reason for writing a separate test, rather than replacing the existing test.

OK, that makes sense. I do wonder if there are any errors that can be exposed by the first test that wouldn't also be exposed by the second 🤔. But I'm OK with leaving both 👍.

@pmachapman pmachapman merged commit 7cf7238 into main Apr 1, 2026
2 checks passed
@pmachapman pmachapman deleted the separate-usfm-text-corpora branch April 1, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash when additional training data is used

4 participants