Skip to content

Allow arbitrary flavor strings for trainer CLI without code changes #1388

@cboulanger

Description

@cboulanger

Background

TrainerRunner.java currently contains a hardcoded if-else chain that maps CLI model name strings (e.g., segmentation-dh-law-footnotes) to Java trainer classes and Flavor enum values. Every time a new document-type flavor is added, this file must be modified. Similarly, the Flavor enum in GrobidModels.java must be extended.

Analysis

Hard technical reasons (code changes remain necessary)

The only genuine code-level constraint is SAX parser selection: trainer classes (SegmentationTrainer, HeaderTrainer, FulltextTrainer) select a TEI XML parser based on the Flavor enum, because different document types have structurally different TEI annotations. Adding a genuinely new document structure still requires a new parser class.

What does NOT need to be hardcoded

  • Model/dataset path resolution already works with arbitrary strings: GrobidModels.modelFor(String) creates GrobidModel objects for any string, deriving paths from getFolderName(). No enum entry is needed.
  • The if-else in TrainerRunner: The base model name (e.g., segmentation, header) is what determines the trainer class — this is a small, stable set. The flavor part is what keeps growing unnecessarily.
  • Parser selection in trainers: For new flavors that reuse an existing SAX parser (or fall back to the default), no code changes should be needed at all.

Aim

The aim of this issue is to allow flavor names to be arbitrary strings that identify the models or datasets to be used, independent of the codebase. Concretely:

  • A user creates a new corpus directory, e.g. resources/dataset/segmentation/my-domain/corpus
  • They train on it immediately by passing segmentation/my-domain as the model argument — without any Java code changes
  • Only when a new document type requires a new SAX parser does any code need updating (and only in the relevant trainer class, not in TrainerRunner)

Proposed Changes

1. New CLI argument format for TrainerRunner

Support {baseModel}/{flavorLabel} where the first /-delimited segment identifies the trainer class and the remainder is the flavor path (used directly as the model folder suffix):

CLI argument Result
segmentation SegmentationTrainer() — unchanged
segmentation/article/light SegmentationTrainer("article/light") — same as current segmentation-light
segmentation/my-domain SegmentationTrainer("my-domain")new, no code changes needed

Backward compatibility: Keep all existing hardcoded cases. Add a final else that parses {base}/{flavorLabel} dynamically.

2. String-based constructor in trainer classes

Add a String flavorLabel constructor to SegmentationTrainer, HeaderTrainer, and FulltextTrainer:

public SegmentationTrainer(String flavorLabel) {
    super(GrobidModels.modelFor("segmentation/" + flavorLabel));
    // Flavor.fromLabel returns null for unknown flavors → default parser is used
    this.flavor = Flavor.fromLabel(flavorLabel);
}

3. Flavor enum unchanged

The Flavor enum in GrobidModels.java stays as the registry for named flavors with custom parsers, but is no longer the only way to express a flavor. Unknown flavor strings fall back to the default parser.

Files to modify

  • grobid-trainer/src/main/java/org/grobid/trainer/TrainerRunner.java
  • grobid-trainer/src/main/java/org/grobid/trainer/SegmentationTrainer.java
  • grobid-trainer/src/main/java/org/grobid/trainer/HeaderTrainer.java
  • grobid-trainer/src/main/java/org/grobid/trainer/FulltextTrainer.java

Verification

  1. Build: ./gradlew :grobid-trainer:shadowJar --no-daemon
  2. Existing flavor still works: ... 0 segmentation-light -gH grobid-home
  3. New format equivalent: ... 0 segmentation/article/light -gH grobid-home
  4. Arbitrary new flavor (with corpus dir present): ... 0 segmentation/my-domain -gH grobid-home resolves corpus at resources/dataset/segmentation/my-domain/corpus

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions