grobidOrg · lfoppiano · Jan 19, 2026 · Mar 3, 2026 · Mar 4, 2026 · Mar 4, 2026
diff --git a/doc/Grobid-service.md b/doc/Grobid-service.md
@@ -176,6 +176,7 @@ Extract the header of the input PDF document, normalize it and convert it into a
 |           |                       |                   | `includeRawCopyrights`   | optional         | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result).                                                               |
 |           |                       |                   | `start`                  | optional         | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF)                                                             |
 |           |                       |                   | `end`                    | optional         | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `2`, end with the last page of the PDF)                                                                        |
+|           |                       |                   | `typedAreas`             | optional         | JSON array specifying areas with coordinates and types for specialized processing (see [Typed Areas](#typed-areas) below)                                                                                      |
 
 
 Use `Accept: application/x-bibtex` to retrieve BibTeX format instead of XML TEI. 
@@ -229,6 +230,7 @@ Convert the complete input document into TEI XML format (header, body and biblio
 |           |                       |                      | `start`                  | optional        | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF)                                                                               |
 |           |                       |                      | `end`                    | optional        | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `-1`, end with the last page of the PDF)                                                                                         |
 |           |                       |                      | `flavor`                 | optional        | Indicate which flavor to apply for structuring the document. Useful when the default structuring cannot be applied to a specific document (e.g. the body is empty. More technical details and available flavor names in the [dedicated page](Grobid-specialized-processes.md). |
+|           |                       |                      | `typedAreas`             | optional        | JSON array specifying areas with coordinates and types for specialized processing (see [Typed Areas](#typed-areas) below)                                                                                                 |
 
 Response status codes:
 
@@ -291,6 +293,7 @@ Extract and convert all the bibliographical references present in the input docu
 | POST, PUT | `multipart/form-data` | `application/xml`  | `input`                | required      | PDF file to be processed |
 |           |                       |                    | `consolidateCitations` | optional      | `consolidateCitations` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only). |
 |           |                       |                    | `includeRawCitations`  | optional      | `includeRawCitations` is a boolean value, `0` (default. do not include raw reference string in the result) or `1` (include raw reference string in the result). |
+|           |                       |                    | `typedAreas`            | optional      | JSON array specifying areas with coordinates and types for specialized processing (see [Typed Areas](#typed-areas) below) |
 
 Use `Accept: application/x-bibtex` to retrieve BibTeX instead of TEI.
 
@@ -318,6 +321,135 @@ It is possible to include the original raw reference string in the parsed result
 curl -v --form input=@./thefile.pdf --form includeRawCitations=1 localhost:8070/api/processReferences
 ```
 
+## Typed Areas
+
+The typed areas feature allows you to specify regions in PDF documents for specialized processing. Instead of relying solely on automatic detection, you can pre-identify areas containing figures, tables, or content to be ignored. This provides better accuracy and control over the document processing pipeline.
+
+### Supported Area Types
+
+- **`figure`**: Areas containing figures/diagrams that will be processed with the specialized figure model
+- **`table`**: Areas containing tables that will be processed with the specialized table model
+- **`ignore`**: Areas that should be completely excluded from all processing
+
+### JSON Format
+
+The `typedAreas` parameter expects a JSON array with the following structure:
+
+```json
+[
+  {
+    "page": 1,
+    "x": 100.0,
+    "y": 200.0,
+    "width": 300.0,
+    "height": 150.0,
+    "type": "figure"
+  },
+  {
+    "page": 1,
+    "x": 450.0,
+    "y": 200.0,
+    "width": 250.0,
+    "height": 200.0,
+    "type": "table"
+  },
+  {
+    "page": 1,
+    "x": 50.0,
+    "y": 500.0,
+    "width": 500.0,
+    "height": 100.0,
+    "type": "ignore"
+  }
+]
+```
+
+### Parameters
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `page` | integer | Yes | Page number (1-based, following PDF convention) |
+| `x` | number | Yes | X-coordinate of the upper-left corner of the area |
+| `y` | number | Yes | Y-coordinate of the upper-left corner of the area |
+| `width` | number | Yes | Width of the area |
+| `height` | number | Yes | Height of the area |
+| `type` | string | Yes | Area type: `"figure"`, `"table"`, or `"ignore"` |
+
+### Coordinate System
+
+The coordinate system follows the PDF convention:
+- **Origin**: Upper-left corner of the page
+- **Units**: Points (1/72 inch)
+- **Page numbering**: 1-based (first page is page 1)
+
+### Processing Behavior
+
+**Figure areas**:
+- Tokens within figure areas are extracted from the main text processing
+- Applied to the specialized FigureParser model
+- Results are integrated into the TEI output as structured figure elements
+- Bypasses the segmentation model for improved accuracy
+
+**Table areas**:
+- Tokens within table areas are extracted from the main text processing
+- Applied to the specialized TableParser model
+- Results are integrated into the TEI output as structured table elements
+- Bypasses the segmentation model for improved accuracy
+
+**Ignore areas**:
+- Tokens within ignore areas are completely discarded
+- No further processing is performed on these regions
+- Useful for excluding headers, footers, watermarks, or other unwanted content
+
+### Usage Examples
+
+**cURL example with typed areas:**
+```bash
+curl -v -H "Accept: application/xml" \
+  --form input=@./document.pdf \
+  --form typedAreas='[
+    {"page": 1, "x": 100, "y": 200, "width": 300, "height": 150, "type": "figure"},
+    {"page": 1, "x": 450, "y": 200, "width": 250, "height": 200, "type": "table"}
+  ]' \
+  localhost:8070/api/processFulltextDocument
+```
+
+**Python example:**
+```python
+import requests
+import json
+
+typed_areas = [
+    {"page": 1, "x": 100, "y": 200, "width": 300, "height": 150, "type": "figure"},
+    {"page": 1, "x": 450, "y": 200, "width": 250, "height": 200, "type": "table"}
+]
+
+with open('document.pdf', 'rb') as f:
+    files = {'input': f}
+    data = {'typedAreas': json.dumps(typed_areas)}
+    response = requests.post(
+        'http://localhost:8070/api/processFulltextDocument',
+        files=files,
+        data=data,
+        headers={'Accept': 'application/xml'}
+    )
+```
+
+### Benefits
+
+1. **Improved Accuracy**: Pre-identified figures and tables bypass the segmentation model, reducing detection errors
+2. **Better Quality**: Specialized models applied to known area types produce higher quality results
+3. **Performance**: More efficient processing by avoiding unnecessary model applications
+4. **Control**: Precise control over which regions are processed and how
+5. **Integration**: Seamlessly integrated into existing TEI output structure
+
+### Error Handling
+
+- Invalid JSON format will result in HTTP 400 error
+- Invalid area types will be logged as warnings and skipped
+- Coordinates outside page boundaries will be clamped to valid ranges
+- Missing required fields will cause the area to be skipped with a warning
+
 ### Raw text to TEI conversion services
 
 #### /api/processDate