-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Recall how PDFPlumber extracts superscript text as a part of the cell text, like "Assignment 11".
tables = pdf.pages[1].extract_tables()
for table in tables:
for row in table:
print(row)
Programmatically using a filter on the page can get rid of superscript texts.
def filter(obj):
if obj["object_type"] == "char" and obj["size"] >= 7.0:
return True
elif obj["object_type"] != "char":
return TrueWhen the object is a character, the normal size character size around 7.200000099000022 but the superscript text is only 6.000000082499923. Excluding specific characters by their size can be used.
Dictionary data structure of a "char":
Now:
# print table
def filter(obj):
if obj["object_type"] == "char" and obj["size"] >= 7.0:
return True
elif obj["object_type"] != "char":
return True
tables = pdf.pages[1].filter(filter).extract_tables()
for table in tables:
for row in table:
print(row)We can extract table to this better version:
This is not the only approach to realize for the filter, condition on the y0 and y1 might do it too.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels