Set up the environment by referring to ISI's Table Understanding
Once completed, few packages need to be ungraded/downgraded for the model to run One approach is to do pip check for clashing, uncompatible package verions Namely, the following packages will need to be installed with stated version respectively [for mac_m1 or ARM64x]
- pydantic==1.7.4
- numpy==1.19.0
- thinc==8.0.7
- blis==0.7.11
- spacy==3.3.1
- scikit-learn==0.23.2 [make use of conda install instead of pip] [Reason:to match pickle files of pre trained models]
- pslpython==2.3.2
- Kindly download the PSL module from this file, and ce_models from here
- Place the PSL Module, ce_model, and the input data files under correct directories as in the below snapshot {Note: /tmp/ would generate during runtime}
- Import or copy paste the infersent module from here in the tabular_cell_type_classification>src>helpers.py file
Kindly refer to the Jupyter Notebook attached,
- Notebook helps in:
- collecting information like latitude/longitude, read url fiels, generate the required .xlsx files
- Reading the blocks
- To run a pipeline:
make run PIPELINE=<pipeline name> - Pipelines are defined in
./pipelinespackage - Pipelines will use
./tmpas working directory to download model/data or generate outputs
The inputs are generated by the jupyter notebooks. However, for a few data files that are seperated by Tab delimiter need manual processing. Extra white spaces in the excel sheets needs to be discarded manually. Hence, manually delete the last empty column.
[Possible reason for failure: pandas reads trailing whitespaces] [Possible troubleshoot: Use regex script to load data]
The pipeline takes the file paths of the excel files and outputs a colorized version of the same along with .yaml file.
This file contains data about the row_start (start, end) and column(start, end). The output are generally stored in the temporaary folder "tmp". Hence, the current outputs are attached to the root>output folder.
Log:09/24/2024
Test on Alabama table, accurate list of blocks i.e. variables and obervation data is obtained.
For the MD98-2177 isotopes Khider11, the table understanding pipeline runs successfully. The variables annotations for header are correct, however, the alignment annontations and edges for observation data do not work correctly.
Log 09/25/2024
Tests on Khider2011, Bhattacharya2022, glaubke, blue2019 successful
The module correctly understood the Headers for all tables. Atrributes and some Observation data don't go as expected.

