Table Understanding

Environment Setup

Set up the environment by referring to ISI's Table Understanding

Once completed, few packages need to be ungraded/downgraded for the model to run One approach is to do pip check for clashing, uncompatible package verions Namely, the following packages will need to be installed with stated version respectively [for mac_m1 or ARM64x]

pydantic==1.7.4
numpy==1.19.0
thinc==8.0.7
blis==0.7.11
spacy==3.3.1
scikit-learn==0.23.2 [make use of conda install instead of pip] [Reason:to match pickle files of pre trained models]
pslpython==2.3.2

PSL module, ce_model:

Kindly download the PSL module from this file, and ce_models from here
Place the PSL Module, ce_model, and the input data files under correct directories as in the below snapshot {Note: /tmp/ would generate during runtime}

Infersent module:

Import or copy paste the infersent module from here in the tabular_cell_type_classification>src>helpers.py file

JSON/Txt File Extraction/Data Series per Block Printing:

Kindly refer to the Jupyter Notebook attached,

Notebook helps in:

collecting information like latitude/longitude, read url fiels, generate the required .xlsx files
Reading the blocks

Running table-understanding pipelines

To run a pipeline: make run PIPELINE=<pipeline name>
Pipelines are defined in ./pipelines package
Pipelines will use ./tmp as working directory to download model/data or generate outputs

Inputs:

The inputs are generated by the jupyter notebooks. However, for a few data files that are seperated by Tab delimiter need manual processing. Extra white spaces in the excel sheets needs to be discarded manually. Hence, manually delete the last empty column.

[Possible reason for failure: pandas reads trailing whitespaces] [Possible troubleshoot: Use regex script to load data]

Output:

The pipeline takes the file paths of the excel files and outputs a colorized version of the same along with .yaml file.

This file contains data about the row_start (start, end) and column(start, end). The output are generally stored in the temporaary folder "tmp". Hence, the current outputs are attached to the root>output folder.

Test & Results:

Log:09/24/2024

Test on Alabama table, accurate list of blocks i.e. variables and obervation data is obtained.

For the MD98-2177 isotopes Khider11, the table understanding pipeline runs successfully. The variables annotations for header are correct, however, the alignment annontations and edges for observation data do not work correctly.

Log 09/25/2024

Tests on Khider2011, Bhattacharya2022, glaubke, blue2019 successful

The module correctly understood the Headers for all tables. Atrributes and some Observation data don't go as expected.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.ipynb_checkpoints		.ipynb_checkpoints
input		input
isi-table-understanding		isi-table-understanding
output		output
.DS_Store		.DS_Store
ISI_TUPF.ipynb		ISI_TUPF.ipynb
ISI_TUPF_1.xlsx		ISI_TUPF_1.xlsx
ISI_TUPF_2.xlsx		ISI_TUPF_2.xlsx
ISI_TUPF_3.xlsx		ISI_TUPF_3.xlsx
ISI_TUPF_4.xlsx		ISI_TUPF_4.xlsx
README.md		README.md
data File Arrangement.png		data File Arrangement.png
tmp File Arrangement.png		tmp File Arrangement.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table Understanding

Environment Setup

PSL module, ce_model:

Infersent module:

JSON/Txt File Extraction/Data Series per Block Printing:

Running table-understanding pipelines

Inputs:

Output:

Test & Results:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

doswal/isi-table-understanding

Folders and files

Latest commit

History

Repository files navigation

Table Understanding

Environment Setup

PSL module, ce_model:

Infersent module:

JSON/Txt File Extraction/Data Series per Block Printing:

Running table-understanding pipelines

Inputs:

Output:

Test & Results:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages