Below is a (mostly complete) list of tasks that ExplainaBoard currently supports, along with examples of how to analyze different tasks. In particular text classification is a good example to start with.
General notes:
- Click the link on the task name for more details, or when no link exists you can open the example data to see what the file format looks like.
- You can either analyze an existing dataset included in Datalab or use your own custom dataset. The directions below describe how to do both in most cases, but using DataLab has some advantages such as allowing for easy calculation of training-set features and compatibility with ExplainaBoard online leaderboards. You can check the list of datasets supported in DataLab and add your dataset if it doesn't exist.
- All of the examples below will output a json report to standard out, which you can
pipe to a file such as
report.jsonfor later use. Also, check out our visualization tools.
We welcome contributions of more tasks, or detailed documentation for tasks where the documentation does not yet exist! Please open an issue or file a PR.
- Text Classification
- Text Pair Classification
- Conditional Text Generation
- Language Modeling
- Named Entity Recognition
- Word Segmentation
- Chunking
- Extractive QA
- Multiple Choice QA
- Hybrid Table Text QA
- Aspect-based Sentiment Classification
- KG Link Tail Prediction
- Multiple choice Cloze
- Generative Cloze
- Grammatical Error Correction
- Tabular Classification
- Tabular Regression
- Argument Pair Extraction
- Argument Pair Identification
Text classification consists of classifying text into different categories, such as sentiment values or topics. The below example performs an analysis on the Stanford Sentiment Treebank, a set of sentiment tags over English reviews.
The below example loads the sst2 dataset from DataLab:
explainaboard --task text-classification --dataset sst2 --system-outputs ./data/system_outputs/sst2/sst2-lstm-output.txtThe below example loads a dataset from an existing file:
explainaboard --task text-classification --custom-dataset-paths ./data/system_outputs/sst2/sst2-dataset.tsv --system-outputs ./data/system_outputs/sst2/sst2-lstm-output.txtClassification of pairs of text, such as natural language inference or paraphrase detection. The example below concerns natural language infernce, predicting whether a premise, entails, contradicts, or is neutral with respect to a hypothesis, on the Stanford Natural Language Inference dataset.
The below example loads the snli dataset from DataLab:
explainaboard --task text-pair-classification --dataset snli --system-outputs ./data/system_outputs/snli/snli-roberta-output.txtThe below example loads a dataset from an existing file:
explainaboard --task text-pair-classification --custom-dataset-paths ./data/system_outputs/snli/snli-dataset.tsv --system-outputs ./data/system_outputs/snli/snli-roberta-output.txtConditional text generation concerns generation of one text based on other texts, including tasks like summarization and machine translation. The below example evaluates a summarization system on the CNN-daily mail dataset.
The below example loads a miniature version of the CNN-daily mail dataset (100 lines only) from an existing file:
explainaboard --task summarization --custom-dataset-paths ./data/system_outputs/cnndm/cnndm_mini-dataset.tsv --system-outputs ./data/system_outputs/cnndm/cnndm_mini-bart-output.txt --metrics rouge2 bart_score_en_refNote that this uses two different metrics separated by a space.
You could also load the cnn_dailymail dataset from DataLab.
Because the test set is large we don't include it directly in the explainaboard
repository, but you can get an example by downloading with wget:
wget -P ./data/system_outputs/cnndm/ https://storage.googleapis.com/inspired-public-data/explainaboard/task_data/summarization/cnndm-bart-output.txtThen run the below command and it should work:
explainaboard --task summarization --dataset cnn_dailymail --system-outputs ./data/system_outputs/cnndm/cnndm-bart-output.txt --metrics rouge2Language modeling is the task of predicting the probability for words in a text. You can analyze your language model outputs by inputting a file that has one log probability for each space-separated word. Here is an example:
The below example analyzes the wikitext corpus:
explainaboard --task language-modeling --custom-dataset-paths ./data/system_outputs/wikitext/wikitext-dataset.txt --system-outputs ./data/system_outputs/wikitext-sys1-output.txtNamed entity recognition recognizes entities such as people, organizations, or locations in text. The below examples demonstrate how you can perform such analysis on the CoNLL 2003 English named entity recognition dataset.
The below example loads the conll2003 NER dataset from DataLab:
explainaboard --task named-entity-recognition --dataset conll2003 --sub-dataset ner --system-outputs ./data/system_outputs/conll2003/conll2003-elmo-output.conllAlternatively, you can reference a dataset file directly.
explainaboard --task named-entity-recognition --custom-dataset-paths ./data/system_outputs/conll2003/conll2003-dataset.conll --system-outputs ./data/system_outputs/conll2003/conll2003-elmo-output.conllWord segmentation aims to segment texts without spaces between words.
The below example loads the msr dataset from DataLab:
explainaboard --task word-segmentation --dataset msr --system-outputs ./data/system_outputs/cws/test-msr-predictions.tsvNote that the file test-msr-predictions.tsv can be downloaded here
Alternatively, you can reference a dataset file directly.
explainaboard --task word-segmentation --custom-dataset-paths ./data/system_outputs/cws/test.tsv --system-outputs ./data/system_outputs/cws/prediction.tsvDividing text into syntactically related non-overlapping groups of words.
The below example loads the conll00_chunk dataset from DataLab:
explainaboard --task chunking --dataset conll00_chunk --system-outputs ./data/system_outputs/chunking/test-conll00-predictions.tsvAlternatively, you can reference a dataset file directly.
explainaboard --task chunking --custom-dataset-paths ./data/system_outputs/chunking/dataset-test-conll00.tsv --system-outputs ./data/system_outputs/chunking/test-conll00-predictions.tsvExtractive QA attempts to answer queries based on extracting segments from an evidence passage. The below example performs this extraction on the dataset SQuAD.
Below is an example of referencing the dataset directly.
explainaboard --task qa-extractive --custom-dataset-paths ./data/system_outputs/squad/squad_mini-dataset.json --system-outputs ./data/system_outputs/squad/squad_mini-example-output.json > report.jsonThe below example loads the squad dataset from DataLab. There is an
open issue that prevents the
specification of a dataset split, so this will not work at the moment. But we are
working on it.
explainaboard --task qa-extractive --dataset squad --system-outputs MY_FILE > report.jsonThis task aims to answer a question based on a hybrid of tabular and textual context, e.g., Zhu et al.2021.
The below example loads the tat_qa dataset from DataLab.
explainaboard --task qa-tat --output-file-type json --dataset tat_qa --system-outputs predictions_list.json > report.jsonwhere you can download the file predictions_list.json by:
wget -P ./ https://explainaboard.s3.amazonaws.com/system_outputs/qa_table_text_hybrid/predictions_list.jsonOpen-domain QA aims to answer a question in the form of natural language based on large-scale unstructured documents
Following examples show how an open-domain QA system can be evaluated with detailed analyses using ExplainaBoard CLI.
Using Build-in datasets from DataLab:
explainaboard --task qa-open-domain --dataset natural_questions_comp_gen --system-outputs ./data/system_outputs/qa_open_domain/test.dpr.nq.txt > report.jsonAnswer a question from multiple options. The following example demonstrates this on the metaphor QA dataset.
The below example loads the fig_qa dataset from DataLab.
explainaboard --task qa-multiple-choice --dataset fig_qa --system-outputs ./data/system_outputs/fig_qa/fig_qa-gptneo-output.json > report.jsonAnd this is what it looks like with a custom dataset.
explainaboard --task qa-multiple-choice --custom-dataset-paths ./data/system_outputs/fig_qa/fig_qa-dataset.json --system-outputs ./data/system_outputs/fig_qa/fig_qa-gptneo-output.json > report.jsonPredicting the tail entity of missing links in knowledge graphs
The below example loads the fb15k_237 dataset from DataLab.
wget https://datalab-hub.s3.amazonaws.com/predictions/test_distmult.json
explainaboard --task kg-link-tail-prediction --dataset fb15k_237 --sub-dataset origin --system-outputs test_distmult.json > log.res explainaboard --task kg-link-tail-prediction --custom-dataset-paths ./data/system_outputs/fb15k-237/data_mini.json --system-outputs ./data/system_outputs/fb15k-237/test-kg-prediction-no-user-defined-new.json > report.jsonPredict the sentiment of a text based on a specific aspect.
This is an example with a custom dataset.
explainaboard --task aspect-based-sentiment-classification --custom-dataset-paths ./data/system_outputs/absa/absa-dataset.txt --system-outputs ./data/system_outputs/absa/absa-example-output.tsv > report.jsonFill in a blank based on multiple provided options
This is an example using the dataset from DataLab
explainaboard --task cloze-multiple-choice --dataset gaokao2018_np1 --sub-dataset cloze-multiple-choice --metrics CorrectScore --system-outputs ./integration_tests/artifacts/gaokao/rst_2018_quanguojuan1_cloze_choice.json > report.jsonFill in a blank based on hint
This is an example using the dataset from DataLab
explainaboard --task cloze-generative --dataset gaokao2018_np1 --sub-dataset cloze-hint --metrics CorrectScore --system-outputs ./integration_tests/artifacts/gaokao/rst_2018_quanguojuan1_cloze_hint.json > report.jsonCorrect errors in a text
This is an example using the dataset from DataLab
explainaboard --task grammatical-error-correction --dataset gaokao2018_np1 --sub-dataset writing-grammar --metrics SeqCorrectScore --system-outputs ./integration_tests/artifacts/gaokao/rst_2018_quanguojuan1_gec.json > report.jsonClassification over tabular data takes in a set of features and predicts a class for
the outputs. The example below is over the sst2 dataset used in text classification,
but after the text has been vectorized into bag-of-words features. By default the only
features that is analyzed by ExplainaBoard is the label feature, so you might want to
specify other features to perform bucketing over using the metadata entry in the
dataset json file, as is done in sst2-tabclass-dataset.json below.
The below example loads a dataset from an existing file:
explainaboard --task tabular-classification --custom-dataset-paths ./data/system_outputs/sst2_tabclass/sst2-tabclass-dataset.json --system-outputs ./data/system_outputs/sst2/sst2-lstm-output.txtRegression over tabular data is basically the same as tabular classification above, but the predicted outputs are continuous numbers instead of classes.
The below example loads a dataset from an existing file:
explainaboard --task tabular-regression --custom-dataset-paths ./data/system_outputs/sst2_tabreg/sst2-tabclass-dataset.json --system-outputs ./data/system_outputs/sst2_tabreg/sst2-tabreg-lstm-output.txtThis task aim to detect the argument pairs from each passage pair of review and rebuttal.
The below example loads the ape
dataset from DataLab:
explainaboard --task argument-pair-extraction --dataset ape --system-outputs ./data/system_outputs/ape/ape_predictions.txtGiven an argument, the task aims to identify one matched argument from a list of arguments.
The example below loads the iapi
dataset from DataLab:
explainaboard --task argument-pair-identification --dataset iapi --system-outputs data/system_outputs/iapi/predictions.txt > report.jsonEvaluating the reliability of automated metrics for general text generation tasks, such as text summarization.
The below example loads the meval_summeval dataset from DataLab:
explainaboard --task meta-evaluation-nlg --dataset meval_summeval --sub-dataset coherence --system-outputs ./data/system_outputs/summeval/sumeval_bart.json > report.json