Handle USFM and Text corpora separately in pre-processing#894
Handle USFM and Text corpora separately in pre-processing#894pmachapman merged 2 commits intomainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #894 +/- ##
==========================================
+ Coverage 67.94% 67.98% +0.03%
==========================================
Files 386 386
Lines 21205 21224 +19
Branches 2736 2740 +4
==========================================
+ Hits 14408 14429 +21
+ Misses 5812 5810 -2
Partials 985 985 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
OK! Yeah, we should add a mixed (text+Paratext) test wherever we already have Paratext/text tests and no such mixed test - including extending the Nmt_Paratext() test.
I think this is the right way to solve it. The other option, like you say, would be to make the refs comparable, but you'd have to adjust the scripture filtering to preserve rows that are either non-scripture-refs or scripture refs that refer to verses - i.e., something like
return parallelTextRow.Ref is not ScriptureRef sr || sr.IsVerse;
and filter all training corpora with that, not just the scripture corpora. I think that could work as well. In some ways, I think that would be cleaner, but I'm sure Damien has an opinion.
@Enkidu93 reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on ddaspit).
ddaspit
left a comment
There was a problem hiding this comment.
I'm not sure what the broader implications would be of making different kinds of refs comparable, so I'm okay with this solution.
@ddaspit reviewed 1 file and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on pmachapman).
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):
sourceInferencingCorpus, targetInferencingCorpus, targetCorpus!,
I don't think targetCorpus can be null, unless I'm missing something.
cbaea8d to
cab499a
Compare
pmachapman
left a comment
There was a problem hiding this comment.
@pmachapman made 1 comment.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on ddaspit and Enkidu93).
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I don't think
targetCorpuscan be null, unless I'm missing something.
Done. Thank you! (This was a hangover from an earlier attempt I made to fix the bug.)
ddaspit
left a comment
There was a problem hiding this comment.
@ddaspit reviewed 1 file and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on pmachapman).
ddaspit
left a comment
There was a problem hiding this comment.
@ddaspit reviewed 4 files and all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on pmachapman).
Enkidu93
left a comment
There was a problem hiding this comment.
Great, thanks! Just one comment.
@Enkidu93 reviewed 5 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on pmachapman).
src/Serval/test/Serval.E2ETests/ServalApiTests.cs line 219 at r3 (raw file):
[TestCase(false)] [TestCase(true)] public async Task Nmt_Paratext(bool withAdditionalFiles)
How much time is it adding to double this test? Could we just always add the additional files? Do you see a benefit to testing both separately in an E2E test?
pmachapman
left a comment
There was a problem hiding this comment.
@pmachapman made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on Enkidu93).
src/Serval/test/Serval.E2ETests/ServalApiTests.cs line 219 at r3 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
How much time is it adding to double this test? Could we just always add the additional files? Do you see a benefit to testing both separately in an E2E test?
It adds about 2-3 minutes to the e2e test run time.
The main benefit is that the two tests are both different user scenarios/configurations. That was my main reason for writing a separate test, rather than replacing the existing test.
Enkidu93
left a comment
There was a problem hiding this comment.
@Enkidu93 made 1 comment and resolved 1 discussion.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on pmachapman).
src/Serval/test/Serval.E2ETests/ServalApiTests.cs line 219 at r3 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
It adds about 2-3 minutes to the e2e test run time.
The main benefit is that the two tests are both different user scenarios/configurations. That was my main reason for writing a separate test, rather than replacing the existing test.
OK, that makes sense. I do wonder if there are any errors that can be exposed by the first test that wouldn't also be exposed by the second 🤔. But I'm OK with leaving both 👍.
Fixes #893.
I will write an e2e test on Monday, but I thought I should gather feedback on this PR.
This PR fixes a regression from #882 where text and scripture corpora were being combined, resulting in a crash when comparing a MultiKeyRef to a ScriptureKeyRef. I tried initially just updating those structs to compare to each other, but although that fixed the crash, the text corpora rows disappeared as they weren't scripture. And so I settled on iterating over them separately.
@Enkidu93 Please let me know if I have (or have not) gone about the right way of fixing this bug.
This change is