Skip to content

Rows from table in PDF return as one unbroken string (no indivdual cells) #28

@dmeekerpcg

Description

@dmeekerpcg

When I use the old tabula-java, it will split the cells out of the table but it is not working in tabula-sharp, I just get a whole row/line without individual data broken out. Maybe this is because the table is non-uniform? (different column counts on different rows)

Example table (cannot attach PDF as it has personal info)
TableExample

I am using the latest version of PDFPig but that didn't seem to work. See example code below, maybe i'm doing something wrong with the syntax, just trying to iterate through the row

 using (PdfDocument document = PdfDocument.Open(path, new ParsingOptions() { ClipPaths = false }))
        {
            ObjectExtractor oe = new ObjectExtractor(document);
            PageArea page = oe.Extract(Page);

            // detect canditate table zones
            SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
            var regions = detector.Detect(page);

            IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
            List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
            var table = tables[0];
            var rows = table.Rows;

            string result = "";
            string test = rows[0][0].GetText(); // <---- testing first cell
            Run.PrintLog("Test: " + test);

            foreach (var r in rows)
            {
                foreach (RectangularTextContainer txt in r)
                {
                    result += txt.GetText() + "|";   //<---- for each cell (?)
                }
                result += System.Environment.NewLine;
            }
            Run.PrintLog("Tab result: " + result);
        }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions