Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
f35c280
add interface to ONNX models
lfoppiano Jan 2, 2026
d5a1faa
force wapiti for tests using models interaction
lfoppiano Jan 2, 2026
062378a
fix: test needs to force wapiti
lfoppiano Jan 2, 2026
05df8c5
fix: tests that needs to use wapiti
lfoppiano Jan 2, 2026
2354913
fix: update wapiti force into service too
lfoppiano Jan 2, 2026
4cf6428
feat: migrate preload-embeddings to store float32 instead of picle
lfoppiano Jan 2, 2026
a378fdb
feat: update dockerfile
lfoppiano Jan 2, 2026
6740cd7
feat: add more tests
lfoppiano Jan 3, 2026
1b3997d
fix: Add required --add-opens JVM arguments for LMDB compatibility on…
lfoppiano Jan 3, 2026
58aa749
fix: move chunking of large sequences in the onnx engine
lfoppiano Jan 6, 2026
9999aa6
refactor: update onnx models, remove unused delft models
lfoppiano Jan 6, 2026
77acb04
fix: defensive programming against longer keys
lfoppiano Jan 6, 2026
eb56e53
feat: Update LMDB dependency to 0.9.2
lfoppiano Jan 6, 2026
d89355d
fix: enhance LMDB database error handling by catching specific except…
lfoppiano Jan 6, 2026
dd11f93
fix: update onnx runtime latest version
lfoppiano Jan 7, 2026
8fe9e90
feat: add integration tests
lfoppiano Jan 7, 2026
0ad2596
chore: Add JVM argument for LMDB `sun.nio.ch` access in tests.
lfoppiano Jan 7, 2026
baee295
feat: Refactor annotation process by consolidating feature extraction…
lfoppiano Jan 7, 2026
7430273
feat: Add validation for word embedding format to ensure raw float32 …
lfoppiano Jan 7, 2026
3ffe5a3
update evaluation metrics
lfoppiano Jan 7, 2026
95053fe
update header model
lfoppiano Jan 7, 2026
38aaa74
update header model without COI and AC
lfoppiano Jan 7, 2026
d78ee6e
add BidLSTM_ChainCRF_FEATURES models
lfoppiano Jan 7, 2026
e0287cd
Add GPU-ready library for linux
lfoppiano Jan 8, 2026
ad269e7
add classification models with ONNX
lfoppiano Jan 8, 2026
774e9fa
fix: concurrency
lfoppiano Jan 8, 2026
ed4e9c4
fix: move licence classifier debug information in DEBUG
lfoppiano Jan 8, 2026
f864a15
feat: update ONNX Dockerfile and add CI for building
lfoppiano Jan 8, 2026
ae4dd99
feat: build automatically onnx image on this branch
lfoppiano Jan 8, 2026
f239aa6
fix: CI build
lfoppiano Jan 8, 2026
42912dd
fix: remove onnx models from the crf only image
lfoppiano Jan 8, 2026
a72c7c7
fix: lmdb path
lfoppiano Jan 8, 2026
18d6fc2
fix: make the CI build work
lfoppiano Jan 8, 2026
311e804
fix: JAVA OPS
lfoppiano Jan 9, 2026
884001d
fix: get the model name from the right configuration block
lfoppiano Jan 9, 2026
e07d5f8
fix: classification models configuration
lfoppiano Jan 9, 2026
ab4c088
feat(models): update header models
lfoppiano Jan 26, 2026
20929b1
feat(performances): Add sequential multithread onnx
lfoppiano Jan 10, 2026
9aef18f
fix(concurrency): tune the onnx concurrency
lfoppiano Jan 12, 2026
5c9cba7
fix: use modular approach in inference
lfoppiano Jan 13, 2026
e428bbc
feat: optimize inference
lfoppiano Jan 13, 2026
e0e5735
fix: use the CPU-only library
lfoppiano Jan 13, 2026
85be10a
feat: re-export onnx models with dynamic batch processing
lfoppiano Jan 13, 2026
8b0c676
fix: set ONNX session threads to 1 to prevent CPU oversubscription
lfoppiano Jan 27, 2026
5c1731e
fix: improve word embedding performances, add cache
lfoppiano Jan 27, 2026
e0c22e6
refactor: rename onnx classes, uniform classification and sequence la…
lfoppiano Jan 27, 2026
7441ed0
chore: enable debug logging for word embeddings and optimize cache st…
lfoppiano Jan 28, 2026
900d690
Merge pull request #1359 from grobidOrg/feature/onnx-models-perfs
lfoppiano Jan 28, 2026
34889d8
fix: update model missing the dynamic parameter detection
lfoppiano Jan 31, 2026
1d4afe6
fix: enhance feature extraction in OnnxSequenceLabellingModel
lfoppiano Feb 1, 2026
274556b
feat: add warning when the sequence is truncated (and not chunked)
lfoppiano Feb 1, 2026
363fe7c
feat: update the header model with one having being trained for longer
lfoppiano Feb 1, 2026
8b70916
chore: automatic build for this branch
lfoppiano Feb 1, 2026
772c5e1
fix: adjust feature extraction logic to align with 1-based indices
lfoppiano Feb 3, 2026
932f688
fix: update header model
lfoppiano Feb 4, 2026
1dd3798
feat: add some unit tests
lfoppiano Feb 4, 2026
7ae9dd8
fix: read returnChars from configuration and correctly set it
lfoppiano Feb 4, 2026
7481641
fix: update vocab.json
lfoppiano Feb 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions .github/workflows/ci-build-manual-onnx.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
name: Build and push an ONNX docker image

# This workflow builds the lightweight ONNX/Wapiti-only Docker image
# (no Python/DeLFT/TensorFlow dependencies)

on:
push:
branches:
- feature/onnx-models
workflow_dispatch:
inputs:
custom_tag:
type: string
description: Docker image tag
required: true
default: "latest-onnx"

jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v5
with:
fetch-tags: true
fetch-depth: 0
- name: Set up JDK 21
uses: actions/setup-java@v5
with:
java-version: '21'
distribution: 'temurin'
cache: 'gradle'
- name: Build with Gradle
run: ./gradlew build -x test

docker-build-onnx:
needs: [ build ]
runs-on: ubuntu-latest

steps:
- name: Create more disk space
run: sudo rm -rf /usr/share/dotnet && sudo rm -rf /opt/ghc && sudo rm -rf "/usr/local/share/boost" && sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- uses: actions/checkout@v5
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v5
with:
username: ${{ secrets.DOCKERHUB_USERNAME_LFOPPIANO }}
password: ${{ secrets.DOCKERHUB_TOKEN_LFOPPIANO }}
image: lfoppiano/grobid
registry: docker.io
pushImage: true
tags: latest-onnx, ${{ github.event.inputs.custom_tag || github.sha }}
dockerfile: Dockerfile.onnx
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
- name: Docker Image Summary
run: |
echo "## 🐳 Docker Image Uploaded Successfully" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "**Image Details:**" >> $GITHUB_STEP_SUMMARY
echo "- **Registry:** docker.io" >> $GITHUB_STEP_SUMMARY
echo "- **Image:** lfoppiano/grobid" >> $GITHUB_STEP_SUMMARY
echo "- **Type:** ONNX/Wapiti only (lightweight, no Python/DeLFT)" >> $GITHUB_STEP_SUMMARY
echo "- **Tags:**" >> $GITHUB_STEP_SUMMARY
echo " - \`latest-onnx\`" >> $GITHUB_STEP_SUMMARY
echo " - \`${{ github.event.inputs.custom_tag || github.sha }}\`" >> $GITHUB_STEP_SUMMARY
echo "- **Digest:** \`${{ steps.docker_build.outputs.digest }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "**Features:**" >> $GITHUB_STEP_SUMMARY
echo "- ONNX Runtime for deep learning models (CPU only)" >> $GITHUB_STEP_SUMMARY
echo "- Wapiti CRF for traditional models" >> $GITHUB_STEP_SUMMARY
echo "- No Python, TensorFlow, or DeLFT dependencies" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "**Usage:**" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`bash" >> $GITHUB_STEP_SUMMARY
echo "docker pull lfoppiano/grobid:latest-onnx" >> $GITHUB_STEP_SUMMARY
echo "docker run -t --rm --init -p 8070:8070 -p 8071:8071 lfoppiano/grobid:latest-onnx" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`" >> $GITHUB_STEP_SUMMARY
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -89,3 +89,5 @@ Dockerfile.dataseer
Dockerfile.software
Dockerfile.datastet
.run

.kotlin
5 changes: 3 additions & 2 deletions Dockerfile.crf
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,9 @@ RUN rm -rf grobid-home/lib/lin-32
RUN rm -rf grobid-home/lib/win-*
RUN rm -rf grobid-home/lib/mac-64

# cleaning Delft models
RUN rm -rf grobid-home/models/*-BidLSTM_CRF*
# cleaning Delft and ONNX models
RUN rm -rf grobid-home/models/*-BidLSTM*
RUN rm -rf grobid-home/models/*.onnx

ENV GROBID_SERVICE_OPTS="-Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep"

Expand Down
117 changes: 117 additions & 0 deletions Dockerfile.onnx
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
## Docker GROBID image using ONNX models and/or Wapiti CRF models
## This is a lightweight image without Python/TensorFlow/DeLFT/JEP dependencies
## Uses ONNX Runtime (CPU only)

## See https://grobid.readthedocs.io/en/latest/Grobid-docker/

## usage example with version 0.8.0:
## docker build -t grobid/grobid:0.8.0-onnx --build-arg GROBID_VERSION=0.8.0 --file Dockerfile.onnx .

## run:
## docker run -t --rm --init -p 8070:8070 -p 8071:8071 grobid/grobid:0.8.0-onnx

# -------------------
# build builder image
# -------------------

FROM eclipse-temurin:21-jdk AS builder

USER root

RUN apt-get update && \
apt-get -y upgrade && \
apt-get -y --no-install-recommends install unzip git python3 python3-pip

WORKDIR /opt/grobid-source

# gradle
COPY gradle/ ./gradle/
COPY gradlew ./
COPY gradle.properties ./
COPY build.gradle ./
COPY settings.gradle ./

# git
COPY .git/ ./.git

# source
COPY grobid-home/ ./grobid-home/
COPY grobid-core/ ./grobid-core/
COPY grobid-service/ ./grobid-service/
COPY grobid-trainer/ ./grobid-trainer/

# cleaning unused native libraries before packaging
RUN rm -rf grobid-home/pdf2xml
RUN rm -rf grobid-home/pdfalto/lin-32
RUN rm -rf grobid-home/pdfalto/mac-64
RUN rm -rf grobid-home/pdfalto/mac_arm-64
RUN rm -rf grobid-home/pdfalto/win-*
RUN rm -rf grobid-home/lib/lin-32
RUN rm -rf grobid-home/lib/win-*
RUN rm -rf grobid-home/lib/mac-64
RUN rm -rf grobid-home/lib/lin-64/jep

# Use ONNX configuration (no DeLFT models)
RUN rm grobid-home/config/grobid.yaml && \
mv grobid-home/config/grobid-onnx.yaml grobid-home/config/grobid.yaml

RUN ./gradlew clean assemble --no-daemon --info --stacktrace

# Preload embeddings in raw float32 format for ONNX inference
# Using standalone script that doesn't require DeLFT
RUN pip3 install --no-cache-dir --break-system-packages lmdb requests
COPY grobid-home/scripts/preload_embeddings_standalone.py .
COPY grobid-home/config/resources-registry.json .
RUN python3 preload_embeddings_standalone.py --registry ./resources-registry.json

WORKDIR /opt/grobid
RUN unzip -o /opt/grobid-source/grobid-service/build/distributions/grobid-service-*.zip && \
mv grobid-service* grobid-service
RUN unzip -o /opt/grobid-source/grobid-home/build/distributions/grobid-home-*.zip && \
chmod -R 755 /opt/grobid/grobid-home/pdfalto

# Move preloaded embeddings to final location
RUN mkdir -p /opt/grobid/data/db && \
mv /opt/grobid-source/data/db/* /opt/grobid/data/db/

# -------------------
# build runtime image
# -------------------

FROM eclipse-temurin:21-jre

# setting locale
ENV LANG=C.UTF-8

# Install minimal runtime dependencies
RUN apt-get update && \
apt-get -y --no-install-recommends install \
libxml2 libfontconfig \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /opt/grobid

COPY --from=builder /opt/grobid .

# Add Tini
ENV TINI_VERSION=v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini
ENTRYPOINT ["/tini", "-s", "--"]

WORKDIR /opt/grobid

ENV JAVA_OPTS="-Xmx4g --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED"


CMD ["./grobid-service/bin/grobid-service"]

ARG GROBID_VERSION

LABEL \
authors="The contributors" \
org.label-schema.name="GROBID" \
org.label-schema.description="Image with GROBID service (ONNX/Wapiti only, no DeLFT)" \
org.label-schema.url="https://github.com/kermitt2/grobid" \
org.label-schema.version=${GROBID_VERSION}
Loading
Loading