Docling is a powerful document parsing library that serves as the foundation for extracting structured information from PDF papers in the Paper2Poster system. It provides sophisticated document understanding capabilities, extracting not just text but also tables, figures, layout information, and document structure.
The DocumentConverter is initialized in parse_raw.py with specific pipeline options:
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE # 5.0
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)Key Configuration:
images_scale = 5.0: Upscales images to 360 DPI (72 DPI × 5) for better qualitygenerate_page_images = True: Creates full page images for visual referencegenerate_picture_images = True: Extracts individual figures/charts from the document
When doc_converter.convert(raw_source) is called, Docling performs multiple analyses:
- Uses RT-DETR model to detect layout components
- Identifies: Title, Section Headers, Text, Tables, Figures, Captions, Formulas, List Items, Page Headers/Footers
- Creates bounding boxes for each detected element
- Extracts text while preserving reading order
- Maintains hierarchical structure (sections, subsections)
- Handles multi-column layouts intelligently
- Detects tables using layout model
- Uses TableFormer model for structure analysis
- Extracts table cells, rows, columns, and their relationships
- Preserves table captions
- Identifies figures, charts, and diagrams
- Extracts images with their captions
- Maintains figure-caption associations
raw_markdown = raw_result.document.export_to_markdown()- Clean, structured text representation
- Preserves document hierarchy
- Used as input for LLM to understand paper content
- Comments are stripped:
markdown_clean_pattern.sub("", raw_markdown)
for page_no, page in conv_res.document.pages.items():
page_image_filename = output_dir / f"{doc_filename}-{page_no}.png"
page.image.pil_image.save(fp, format="PNG")- Full page renderings at high resolution
- Used for visual reference and debugging
- Stored for potential manual review
for element, _level in conv_res.document.iterate_items():
if isinstance(element, TableItem):
table_counter += 1
element.get_image(conv_res.document).save(fp, "PNG")Each table is processed to extract:
- Caption:
table.caption_text(conv_res.document) - Visual representation: High-resolution PNG image
- Dimensions: Width, height for layout planning
- Aspect ratio: For optimal poster placement
- Table structure: Cells, rows, columns (though not directly used in current implementation)
if isinstance(element, PictureItem):
picture_counter += 1
element.get_image(conv_res.document).save(fp, "PNG")Each figure/image is processed to extract:
- Caption: Associated descriptive text
- High-res image: Extracted at 5× scale
- Metadata: Dimensions, aspect ratio, file size
- Position: Original location in document
-
Initial Parsing (
parse_raw.py):- Docling converts PDF → structured document
- Markdown exported for text content
- Falls back to Marker if Docling fails (text < 500 chars)
-
Content Generation:
- LLM uses markdown text to create poster sections
- Maintains paper structure and key information
-
Figure/Table Filtering (
filter_image_table):- Analyzes extracted figures and tables
- Calculates min/max display sizes based on scale ratios
- LLM selects relevant visuals for poster
-
Layout Planning:
- Uses figure/table dimensions for space allocation
- Considers aspect ratios for optimal placement
- Integrates captions with visual elements
-
Final Poster Assembly:
- Places selected figures/tables in designated areas
- Maintains visual quality through high-resolution extraction
- Preserves caption-visual associations
# Embedded images in markdown
conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)
# External image references
conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)
# HTML output
conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED)These outputs are saved for:
- Debugging and verification
- Alternative processing pipelines
- Manual review and editing
- Docling preserves section hierarchy
- Enables context-aware content extraction
- Maintains logical flow for poster narrative
- High-resolution image extraction (5× scale)
- Caption preservation ensures context
- Fallback to Marker for robust parsing
- Accuracy: State-of-the-art models for layout and table understanding
- Completeness: Extracts all document components (text, tables, figures)
- Structure Preservation: Maintains document hierarchy and relationships
- Visual Quality: High-resolution extraction for poster-quality images
- Robustness: Handles complex layouts and multi-column formats
- Complex Formulas: Currently extracted as images, not LaTeX
- Handwritten Content: May not be accurately extracted
- Scanned PDFs: Relies on OCR quality (can be configured)
- Very Large Tables: May need manual adjustment for poster format
The Docling integration provides Paper2Poster with sophisticated document understanding capabilities, enabling it to transform complex academic papers into well-structured, visually appealing posters while preserving all critical information and visual elements.