This repo contains the code and experimental results for the D2T Datasets Content Type Profiling.
Note: This repository is to compliment the submitted paper. It will be deleted after conference's anonymity period is over.
- 
Train a Multi-Label Content Type classifier with and without Active Learning
- Without AL
 
CUDA_VISIBLE_DEVICES=0 python3 src/al_main.py -dataset mlb -a_class -e_class
- With AL
 
CUDA_VISIBLE_DEVICES=0 python3 src/al_main.py -dataset mlb -a_class -e_class -do_al -qs qbc -tk 25
 - 
Plot Content Type Distribution graphs for different datasets
CUDA_VISIBLE_DEVICES=0 python3 src/plot_res.py -dataset mlb -a_class -e_class -type gold_ns
 - 
Evaluate the performance of NLG systems' output texts on different metrics
sh run_eval.sh mlb acc 0
 - 
Label: Content Type classifier data; and accuracy errors in NLG systems' output texts
- Use Label-Studio to label the data - config saved in 
labdata(Docker needed) 
docker run -it -p 8080:8080 -v `pwd`/labdata:/label-studio/data heartexlabs/label-studio:latest - Use Label-Studio to label the data - config saved in 
 
- 
sportsett/: everything used for the sportsett data experimentssportsett/data/: contains data/annotations for builidng Content-Type classifiersportsett/data/initial: contains data for training generation systems
sportsett/eval/: contains data/annotations from human evaluation of system generated summaries
 - 
mlb/: everything used for the mlb data experiments - 
sumtime/: everything used for the sumtime data experiments - 
obituary/: everything used for the obituary data experiments - 
labdata/: folder to store the docker data for labelling app (databse and settings) - 
eval/: contains code for calculating evaluation results - 
src/: contains the source codeal_utils.py: contains the functions for active learningclf_utils.py: contains the functions for classifierbert_utils.py: contains just plain bert classifier (fine-tuned on this data)merge_annotated.py: merges the annotated json file with the already annotated samples intrain.tsvfileal_main.py: contains the main code for classifier and active learningabs_sent.py: contains the functions for sentence abstracting (using PoS/NER tags)plot_res.py: code for plotting different across dataset graphsrw_plots.py: code for plotting grpahs specific to RotoWire and SportSett
 - 
run_first.sh: script to run the first time to create thetop_{k}_unlabelled.txtfile. - 
run_active_learning.sh: script to run the afterrun_first.shis executed once and thetop_{k}_unlabelled.txtfile is created. - 
plots.sh: script to plot performance change with change in data. 
Download and save the trained models in the respective datasets' folder from GDrive.
- 
Annotate some data and create the
train.tsv/valid.tsvfiles in{dataset_name}/data/tsvsfolder. - 
Run to create the
top_{k}_unlabelled.txtfile.python3 src/al_main.py -qs qbc -tk 25 -dataset mlb -do_al -a_class
 - 
Take the
top_{k}_unlabelled.txtfile from{dataset_name}/data/txtsfolder and annotate it. - 
Save the annotations is
jsonformat in{dataset_name}/data/jsonsfolder with nameannotations.json. - 
Run the the following to merge new annotations with the existing ones in
{dataset_name}/data/tsvs/train.tsvfile.python3 src/merge_annotated.py -dataset mlb -not_first_run
 - 
Now again run the
src/al_main.pyto retrain models on extended data and create newtop_{k}_unlabelled.txtfile.python3 src/al_main.py -qs qbc -tk 25 -dataset mlb -do_al -a_class
 - 
Repeat step 3 to step 6 until needed.
 
In terms of what files to run and in what order:
sh run_first.sh- Label the samples from unlabelled pool using Label-Studio app (specifically, label the samples in 
data/txts/top_{k}_unlabelled.txtfile, where k is the TOP_K insrc/main.py) sh run_active_learning.sh- Repeat 2 & 3 until you have labelled all the samples or reached desired performance
 
- Make sure to run 
pip install -r requirements.txtbefore running the scripts. 
- 
Run
run_first.shthis will first train models on test data and then rank the samples from unlabelled pool based on uncertainity.- This will create models in 
models/and ftrs inftrs/ - In 
data/txtsnew filetop_{k}_unlabelled.txtwill be created with top {k} samples from unlabelled pool. 
 - This will create models in 
 - 
Label the samples from unlabelled pool (
data/txts/top_{k}_unlabelled.txt).- Save the annotated json file in 
data/json/annotated.json 
 - Save the annotated json file in 
 - 
Merge the newly annotated and existing annotated samples and repeat the process from 1-3.
- This can be done by 
run_active_learning.sh 
 - This can be done by 
 
We use Label-Studio for labelling the messages.
For that you need docker to be installed.
- 
Install docker and start the engine.
 - 
Run the following command to start the app:
docker run -it -p 8080:8080 -v `pwd`/labdata:/label-studio/data heartexlabs/label-studio:latest - 
Go to
http://localhost:8080and login with the following credentials:Email: nlg.ct Password: nlg.ct12345
 - 
If no data is present, then you would need to upload the data. The following screen should be visible:
- Follow the instructions from 5-7 if no file is uploaded.
 - If file is uploaded, then you would need to upload the data again. For that, follow the instructions from 8-9.
 - The following screen should be visible if data is already uploaded:
 
 - 
Upload the unlabelled data (
./data/txts/top_{k}_unlabelled.txtfile) by following these steps:- 
Click the Go to import button.
 - 
Either click Upload Files or drag and drop the file into the Drop Area.
 - 
Select the List of tasks option for Treat CSV/TSV as question.
 - 
Now click Import button on top right corner. You will see the following screen:
 
 - 
 - 
Now you can start labelling the data.
- Cilck on either Label All Tasks button or any of row.
 - You will see the sentence for labelling with possible labels.
 - Select the labels (more than one can be selected) and click Submit button.
 
 - 
After labelling, cick the Export button. Select JSON option and click the Export button.
- This will download the file to your local machine (at preffered download location).
 - Save the file in 
data/jsons/annotated.json. Make sure to remove any existing file from thedata/jsonslocation. 
 - 
If already some data is uploaded, then you would need to delete the existing data upload the the new one again. The following screen should be visible:
 - 
To delete the existing data, follow these steps:
- 
Click the box in front of ID in top-left. This will select all the rows.
 - 
Click the {k} Tasks button above ID. Click the Delete tasks button from the drop-down menu appeared. Here's a screenshot:
 
 - 
 - 
After deleting the data, you will see the screen similar to the one shown below:
- Follow the instructions from 6-7 to start labelling the data.
 
 


