Replies: 5 comments 10 replies
-
| A typical "Discussions" item, so let me convert this first. | 
Beta Was this translation helpful? Give feedback.
-
| There is no (reliable / foolproof) way to identify headers or footers: it's all just text on the page. 
 So as soon as you are able to provide rules that allow filtering text in this way, removal as such is easy-peasy in PyMuPDF when using redaction annotations. | 
Beta Was this translation helpful? Give feedback.
-
| Since different PDFs might have different headers and footers with various positions and margins, it is not recommended to use the hardcoded rule-based algorithms. Here are two example results of the categorization for headers/footers and body texts without any improvements: Note that there are only "two" clusters, because I choose the cluster which has most points as the "body text" cluster, and combine the others to one as the "headers/footers" cluster. Here are my scripts, you might need to fill in your real-world values in some places. 
 from collections import Counter
from sklearn.cluster import DBSCAN
import numpy as np
class PDFTextBlockCategorizer:
    def __init__(self, blocks):
        self.blocks = blocks
    def run(self):
        X = np.array(
            [(x0, y0, x1, y1, len(text)) for x0, y0, x1, y1, text in self.blocks]
        )
        dbscan = DBSCAN()
        dbscan.fit(X)
        labels = dbscan.labels_
        self.n_clusters = len(np.unique(labels))
        label_counter = Counter(labels)
        most_common_label = label_counter.most_common(1)[0][0]
        labels = [0 if label == most_common_label else 1 for label in labels]
        self.labels = labels
        print(f"{self.n_clusters} clusters for {len(self.blocks)} blocks")and  import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import fitz
from pathlib import Path
from itertools import islice
from utils.categorizer import PDFTextBlockCategorizer
class PDFExtractor:
    pdf_root = "..."
    def __init__(self):
        pdf_filename = "***.pdf"
        self.pdf_fullpath = self.pdf_root / pdf_filename
        self.pdf_doc = fitz.open(self.pdf_fullpath)
    def calc_rect_center(self, rect, reverse_y=False):
        if reverse_y:
            x0, y0, x1, y1 = rect[0], -rect[1], rect[2], -rect[3]
        else:
            x0, y0, x1, y1 = rect
        x_center = (x0 + x1) / 2
        y_center = (y0 + y1) / 2
        return (x_center, y_center)
    def extract_all_text_blocks(self):
        # * https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractBLOCKS
        rect_centers = []
        rects = []
        visual_label_texts = []
        categorize_vectors = []
        for page_idx, page in islice(enumerate(self.pdf_doc), len(self.pdf_doc)):
            blocks = page.get_text("blocks")
            page_cnt = page_idx + 1
            print(f"=== Start Page {page_cnt}: {len(blocks)} blocks ===")
            block_cnt = 0
            for block in blocks:
                block_rect = block[:4]  # (x0,y0,x1,y1)
                x0, y0, x1, y1 = block_rect
                rects.append(block_rect)
                block_text = block[4]
                block_num = block[5]
                # block_cnt += 1
                block_cnt = block_num + 1
                rect_center = self.calc_rect_center(block_rect, reverse_y=True)
                rect_centers.append(rect_center)
                # visual_label_text = f"{block_text.split()[-1]}({page_cnt}.{block_cnt})"
                visual_label_text = f"({page_cnt}.{block_cnt})"
                visual_label_texts.append(visual_label_text)
                block_type = "text" if block[6] == 0 else "image"
                print(f"Block: {page_cnt}.{block_cnt}")
                print(f"<{block_type}> {rect_center} - {block_rect}")
                print(block_text)
                categorize_vectors.append((*block_rect, block_text))
            print(f"=== End Page {page_cnt}: {len(blocks)} blocks ===\n")
        categorizer = PDFTextBlockCategorizer(categorize_vectors)
        categorizer.run()
        fig, ax = plt.subplots()
        colors = ["b", "r", "g", "c", "m", "y", "k"]
        for i, rect_center in enumerate(rect_centers):
            label_idx = categorizer.labels[i]
            color = colors[label_idx]
            x0, y0, x1, y1 = rects[i]
            rect = Rectangle((x0, -y0), x1 - x0, -y1 + y0, fill=False, edgecolor=color)
            ax.add_patch(rect)
            x, y = rect_center
            plt.scatter(x, y, color=color)
            plt.annotate(visual_label_texts[i], rect_center)
        plt.show()
    def run(self):
        self.extract_all_text_blocks()
if __name__ == "__main__":
    pdf_extractor = PDFExtractor()
    pdf_extractor.run()Run with: You would get similar results to mine. | 
Beta Was this translation helpful? Give feedback.
-
| I've implemented an effective solution to remove headers/footers. You can check it out here: https://medium.com/@hussainshahbazkhawaja/paper-implementation-header-and-footer-extraction-by-page-association-3a499b2552ae | 
Beta Was this translation helpful? Give feedback.
-
| I am also working on a solution to define the text box boundaries excluding headers and footers: https://github.com/mirix/retrieval-augmented-generation/blob/main/test_hdbscan_fitz.py Tested on the following document: https://links.imagerelay.com/cdn/2958/ql/general-terms-and-conditions-sqbe-en A few hints are given for a more robust solution. It uses HDBSCAN but actually DBSCAN is probably better. | 
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
-
🤔 Is your feature request related to a problem? Please describe.
Most AI models are not trained on PDF data since parsing it is difficult. I'm working on a PDF parsing project that removes tables, charts headers, etc., so extraction libraries like PyMuPDF can improve significantly.
I solved table removal; I would love to solve header removal now.
💡 Describe the solution you'd like
Can we remove headers/footers on PDFs so the output of
page.get_text()is cleaner?Beta Was this translation helpful? Give feedback.
All reactions