Content Type Profiling of Data-to-Text Generation Dataset

This repo contains the code and experimental results for the D2T Datasets Content Type Profiling.

Note: This repository is to compliment the submitted paper. It will be deleted after conference's anonymity period is over.

Things that can be done with the code in this repo

Train a Multi-Label Content Type classifier with and without Active Learning

Without AL

CUDA_VISIBLE_DEVICES=0 python3 src/al_main.py -dataset mlb -a_class -e_class

With AL

CUDA_VISIBLE_DEVICES=0 python3 src/al_main.py -dataset mlb -a_class -e_class -do_al -qs qbc -tk 25

Plot Content Type Distribution graphs for different datasets

CUDA_VISIBLE_DEVICES=0 python3 src/plot_res.py -dataset mlb -a_class -e_class -type gold_ns

Evaluate the performance of NLG systems' output texts on different metrics
```
sh run_eval.sh mlb acc 0
```
Label: Content Type classifier data; and accuracy errors in NLG systems' output texts
- Use Label-Studio to label the data - config saved in labdata (Docker needed)
```
docker run -it -p 8080:8080 -v `pwd`/labdata:/label-studio/data heartexlabs/label-studio:latest
```

Directory Structure

sportsett/: everything used for the sportsett data experiments
- sportsett/data/: contains data/annotations for builidng Content-Type classifier
  - sportsett/data/initial: contains data for training generation systems
- sportsett/eval/: contains data/annotations from human evaluation of system generated summaries
mlb/: everything used for the mlb data experiments
sumtime/: everything used for the sumtime data experiments
obituary/: everything used for the obituary data experiments
labdata/: folder to store the docker data for labelling app (databse and settings)
eval/: contains code for calculating evaluation results
src/: contains the source code
- al_utils.py: contains the functions for active learning
- clf_utils.py: contains the functions for classifier
- bert_utils.py: contains just plain bert classifier (fine-tuned on this data)
- merge_annotated.py: merges the annotated json file with the already annotated samples in train.tsv file
- al_main.py: contains the main code for classifier and active learning
- abs_sent.py: contains the functions for sentence abstracting (using PoS/NER tags)
- plot_res.py: code for plotting different across dataset graphs
- rw_plots.py: code for plotting grpahs specific to RotoWire and SportSett
run_first.sh: script to run the first time to create the top_{k}_unlabelled.txt file.
run_active_learning.sh: script to run the after run_first.sh is executed once and the top_{k}_unlabelled.txt file is created.
plots.sh: script to plot performance change with change in data.

Trained Content-Type Classifiers

Download and save the trained models in the respective datasets' folder from GDrive.

How to Run

Step-by-Step

Annotate some data and create the train.tsv/valid.tsv files in {dataset_name}/data/tsvs folder.

Run to create the top_{k}_unlabelled.txt file.

python3 src/al_main.py -qs qbc -tk 25 -dataset mlb -do_al -a_class

Take the top_{k}_unlabelled.txt file from {dataset_name}/data/txts folder and annotate it.
Save the annotations is json format in {dataset_name}/data/jsons folder with name annotations.json.
Run the the following to merge new annotations with the existing ones in {dataset_name}/data/tsvs/train.tsv file.
```
python3 src/merge_annotated.py -dataset mlb -not_first_run
```
Now again run the src/al_main.py to retrain models on extended data and create new top_{k}_unlabelled.txt file.
```
python3 src/al_main.py -qs qbc -tk 25 -dataset mlb -do_al -a_class
```
Repeat step 3 to step 6 until needed.

TL;DR

In terms of what files to run and in what order:

sh run_first.sh
Label the samples from unlabelled pool using Label-Studio app (specifically, label the samples in data/txts/top_{k}_unlabelled.txt file, where k is the TOP_K in src/main.py)
sh run_active_learning.sh
Repeat 2 & 3 until you have labelled all the samples or reached desired performance

NOTE

Make sure to run pip install -r requirements.txt before running the scripts.

Detailed

Run run_first.sh this will first train models on test data and then rank the samples from unlabelled pool based on uncertainity.
- This will create models in models/ and ftrs in ftrs/
- In data/txts new file top_{k}_unlabelled.txt will be created with top {k} samples from unlabelled pool.
Label the samples from unlabelled pool (data/txts/top_{k}_unlabelled.txt).
- Save the annotated json file in data/json/annotated.json
Merge the newly annotated and existing annotated samples and repeat the process from 1-3.
- This can be done by run_active_learning.sh

Labelling App

We use Label-Studio for labelling the messages.

For that you need docker to be installed.

How to Run

Install docker and start the engine.

Run the following command to start the app:

docker run -it -p 8080:8080 -v `pwd`/labdata:/label-studio/data heartexlabs/label-studio:latest

Go to http://localhost:8080 and login with the following credentials:
```
Email: nlg.ct
Password: nlg.ct12345
```
If no data is present, then you would need to upload the data. The following screen should be visible:

Screen after login without data
- Follow the instructions from 5-7 if no file is uploaded.
- If file is uploaded, then you would need to upload the data again. For that, follow the instructions from 8-9.
- The following screen should be visible if data is already uploaded:
Screen after login with data
Upload the unlabelled data (./data/txts/top_{k}_unlabelled.txt file) by following these steps:
- Click the Go to import button.
- Either click Upload Files or drag and drop the file into the Drop Area.
- Select the List of tasks option for Treat CSV/TSV as question.
- Now click Import button on top right corner. You will see the following screen:
  
  Screen after data upload
Now you can start labelling the data.
- Cilck on either Label All Tasks button or any of row.
- You will see the sentence for labelling with possible labels.
- Select the labels (more than one can be selected) and click Submit button.
After labelling, cick the Export button. Select JSON option and click the Export button.
- This will download the file to your local machine (at preffered download location).
- Save the file in data/jsons/annotated.json. Make sure to remove any existing file from the data/jsons location.
If already some data is uploaded, then you would need to delete the existing data upload the the new one again. The following screen should be visible:

Screen with data
To delete the existing data, follow these steps:
- Click the box in front of ID in top-left. This will select all the rows.
- Click the {k} Tasks button above ID. Click the Delete tasks button from the drop-down menu appeared. Here's a screenshot:
  
  Screen for deleting all data
After deleting the data, you will see the screen similar to the one shown below:

Screen after deleting all data
- Follow the instructions from 6-7 to start labelling the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Content Type Profiling of Data-to-Text Generation Dataset

Things that can be done with the code in this repo

Directory Structure

Trained Content-Type Classifiers

How to Run

Step-by-Step

TL;DR

NOTE

Detailed

Labelling App

How to Run

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
eval		eval
labdata		labdata
mlb		mlb
obituary		obituary
sportsett		sportsett
src		src
sumtime		sumtime
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install_deps.sh		install_deps.sh
requirements.txt		requirements.txt
run_active_learning.sh		run_active_learning.sh
run_eval.sh		run_eval.sh
run_first.sh		run_first.sh

License

ashishu007/Content-Type-Profiling

Folders and files

Latest commit

History

Repository files navigation

Content Type Profiling of Data-to-Text Generation Dataset

Things that can be done with the code in this repo

Directory Structure

Trained Content-Type Classifiers

How to Run

Step-by-Step

TL;DR

NOTE

Detailed

Labelling App

How to Run

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages