Skip to content

doswal/isi-table-understanding

Repository files navigation

Table Understanding

Environment Setup

Set up the environment by referring to ISI's Table Understanding

Once completed, few packages need to be ungraded/downgraded for the model to run One approach is to do pip check for clashing, uncompatible package verions Namely, the following packages will need to be installed with stated version respectively [for mac_m1 or ARM64x]

  • pydantic==1.7.4
  • numpy==1.19.0
  • thinc==8.0.7
  • blis==0.7.11
  • spacy==3.3.1
  • scikit-learn==0.23.2 [make use of conda install instead of pip] [Reason:to match pickle files of pre trained models]
  • pslpython==2.3.2

PSL module, ce_model:

  • Kindly download the PSL module from this file, and ce_models from here
  • Place the PSL Module, ce_model, and the input data files under correct directories as in the below snapshot {Note: /tmp/ would generate during runtime}

alt text alt text

Infersent module:

  • Import or copy paste the infersent module from here in the tabular_cell_type_classification>src>helpers.py file

JSON/Txt File Extraction/Data Series per Block Printing:

Kindly refer to the Jupyter Notebook attached,

  • Notebook helps in:
  • collecting information like latitude/longitude, read url fiels, generate the required .xlsx files
  • Reading the blocks

Running table-understanding pipelines

  • To run a pipeline: make run PIPELINE=<pipeline name>
  • Pipelines are defined in ./pipelines package
  • Pipelines will use ./tmp as working directory to download model/data or generate outputs

Inputs:

The inputs are generated by the jupyter notebooks. However, for a few data files that are seperated by Tab delimiter need manual processing. Extra white spaces in the excel sheets needs to be discarded manually. Hence, manually delete the last empty column.

[Possible reason for failure: pandas reads trailing whitespaces] [Possible troubleshoot: Use regex script to load data]

Output:

The pipeline takes the file paths of the excel files and outputs a colorized version of the same along with .yaml file.

This file contains data about the row_start (start, end) and column(start, end). The output are generally stored in the temporaary folder "tmp". Hence, the current outputs are attached to the root>output folder.

Test & Results:

Log:09/24/2024

Test on Alabama table, accurate list of blocks i.e. variables and obervation data is obtained.

For the MD98-2177 isotopes Khider11, the table understanding pipeline runs successfully. The variables annotations for header are correct, however, the alignment annontations and edges for observation data do not work correctly.

Log 09/25/2024

Tests on Khider2011, Bhattacharya2022, glaubke, blue2019 successful

The module correctly understood the Headers for all tables. Atrributes and some Observation data don't go as expected.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •