Background
TrainerRunner.java currently contains a hardcoded if-else chain that maps CLI model name strings (e.g., segmentation-dh-law-footnotes) to Java trainer classes and Flavor enum values. Every time a new document-type flavor is added, this file must be modified. Similarly, the Flavor enum in GrobidModels.java must be extended.
Analysis
Hard technical reasons (code changes remain necessary)
The only genuine code-level constraint is SAX parser selection: trainer classes (SegmentationTrainer, HeaderTrainer, FulltextTrainer) select a TEI XML parser based on the Flavor enum, because different document types have structurally different TEI annotations. Adding a genuinely new document structure still requires a new parser class.
What does NOT need to be hardcoded
- Model/dataset path resolution already works with arbitrary strings:
GrobidModels.modelFor(String) creates GrobidModel objects for any string, deriving paths from getFolderName(). No enum entry is needed.
- The if-else in TrainerRunner: The base model name (e.g.,
segmentation, header) is what determines the trainer class — this is a small, stable set. The flavor part is what keeps growing unnecessarily.
- Parser selection in trainers: For new flavors that reuse an existing SAX parser (or fall back to the default), no code changes should be needed at all.
Aim
The aim of this issue is to allow flavor names to be arbitrary strings that identify the models or datasets to be used, independent of the codebase. Concretely:
- A user creates a new corpus directory, e.g.
resources/dataset/segmentation/my-domain/corpus
- They train on it immediately by passing
segmentation/my-domain as the model argument — without any Java code changes
- Only when a new document type requires a new SAX parser does any code need updating (and only in the relevant trainer class, not in
TrainerRunner)
Proposed Changes
1. New CLI argument format for TrainerRunner
Support {baseModel}/{flavorLabel} where the first /-delimited segment identifies the trainer class and the remainder is the flavor path (used directly as the model folder suffix):
| CLI argument |
Result |
segmentation |
SegmentationTrainer() — unchanged |
segmentation/article/light |
SegmentationTrainer("article/light") — same as current segmentation-light |
segmentation/my-domain |
SegmentationTrainer("my-domain") — new, no code changes needed |
Backward compatibility: Keep all existing hardcoded cases. Add a final else that parses {base}/{flavorLabel} dynamically.
2. String-based constructor in trainer classes
Add a String flavorLabel constructor to SegmentationTrainer, HeaderTrainer, and FulltextTrainer:
public SegmentationTrainer(String flavorLabel) {
super(GrobidModels.modelFor("segmentation/" + flavorLabel));
// Flavor.fromLabel returns null for unknown flavors → default parser is used
this.flavor = Flavor.fromLabel(flavorLabel);
}
3. Flavor enum unchanged
The Flavor enum in GrobidModels.java stays as the registry for named flavors with custom parsers, but is no longer the only way to express a flavor. Unknown flavor strings fall back to the default parser.
Files to modify
grobid-trainer/src/main/java/org/grobid/trainer/TrainerRunner.java
grobid-trainer/src/main/java/org/grobid/trainer/SegmentationTrainer.java
grobid-trainer/src/main/java/org/grobid/trainer/HeaderTrainer.java
grobid-trainer/src/main/java/org/grobid/trainer/FulltextTrainer.java
Verification
- Build:
./gradlew :grobid-trainer:shadowJar --no-daemon
- Existing flavor still works:
... 0 segmentation-light -gH grobid-home
- New format equivalent:
... 0 segmentation/article/light -gH grobid-home
- Arbitrary new flavor (with corpus dir present):
... 0 segmentation/my-domain -gH grobid-home resolves corpus at resources/dataset/segmentation/my-domain/corpus
Background
TrainerRunner.javacurrently contains a hardcoded if-else chain that maps CLI model name strings (e.g.,segmentation-dh-law-footnotes) to Java trainer classes andFlavorenum values. Every time a new document-type flavor is added, this file must be modified. Similarly, theFlavorenum inGrobidModels.javamust be extended.Analysis
Hard technical reasons (code changes remain necessary)
The only genuine code-level constraint is SAX parser selection: trainer classes (
SegmentationTrainer,HeaderTrainer,FulltextTrainer) select a TEI XML parser based on theFlavorenum, because different document types have structurally different TEI annotations. Adding a genuinely new document structure still requires a new parser class.What does NOT need to be hardcoded
GrobidModels.modelFor(String)createsGrobidModelobjects for any string, deriving paths fromgetFolderName(). No enum entry is needed.segmentation,header) is what determines the trainer class — this is a small, stable set. The flavor part is what keeps growing unnecessarily.Aim
The aim of this issue is to allow flavor names to be arbitrary strings that identify the models or datasets to be used, independent of the codebase. Concretely:
resources/dataset/segmentation/my-domain/corpussegmentation/my-domainas the model argument — without any Java code changesTrainerRunner)Proposed Changes
1. New CLI argument format for TrainerRunner
Support
{baseModel}/{flavorLabel}where the first/-delimited segment identifies the trainer class and the remainder is the flavor path (used directly as the model folder suffix):segmentationSegmentationTrainer()— unchangedsegmentation/article/lightSegmentationTrainer("article/light")— same as currentsegmentation-lightsegmentation/my-domainSegmentationTrainer("my-domain")— new, no code changes neededBackward compatibility: Keep all existing hardcoded cases. Add a final
elsethat parses{base}/{flavorLabel}dynamically.2. String-based constructor in trainer classes
Add a
String flavorLabelconstructor toSegmentationTrainer,HeaderTrainer, andFulltextTrainer:3.
Flavorenum unchangedThe
Flavorenum inGrobidModels.javastays as the registry for named flavors with custom parsers, but is no longer the only way to express a flavor. Unknown flavor strings fall back to the default parser.Files to modify
grobid-trainer/src/main/java/org/grobid/trainer/TrainerRunner.javagrobid-trainer/src/main/java/org/grobid/trainer/SegmentationTrainer.javagrobid-trainer/src/main/java/org/grobid/trainer/HeaderTrainer.javagrobid-trainer/src/main/java/org/grobid/trainer/FulltextTrainer.javaVerification
./gradlew :grobid-trainer:shadowJar --no-daemon... 0 segmentation-light -gH grobid-home... 0 segmentation/article/light -gH grobid-home... 0 segmentation/my-domain -gH grobid-homeresolves corpus atresources/dataset/segmentation/my-domain/corpus