If you find our data, code, or the paper useful, please cite the paper:
@article{chen2024mdcr,
title={MDCR: A Dataset for Multi-Document Conditional Reasoning},
author={Chen, Peter Baile and Zhang, Yi and Liu, Chunwei and Gupta, Sejal and Kim, Yoon and Cafarella, Michael},
journal={arXiv preprint arXiv:2406.11784},
year={2024}
}
docs.json(list of dictionaries): a list of benefit documents crawled from the open web. Each dictionary includes the documenttitle(string),url(string), andcontents(list of strings)parsed.json(list of dictionaries): a list of conditions generated from sentences indocs.json(Some sentences might not be describing conditions and some might include multiple conditions. Details are described in Appendix A.1 in the paper.)conditions(dictionary): the key in the form ofc[int](1-indexed) refers to a condition mentioned in the document; the other keys that start withand/orrefer to the AND/OR relationships of these conditions.all (and)is the expression that represents the entire set of conditions need to be satisfied.- Each condition (
c[int]) is mapped either (1) directly to some sentences in the document or (2) to part of sentences in the document. In case (1), the condition is mapped to an integer/ list of integers that are the indices of sentences in the document. In case (2), the condition is mapped to a string (part of the original sentence). Case (2) exists because some sentences in the document are clearly composed of multiple conditions, and thus we split the sentence into the constituent self-contained conditions.
- Each condition (
mapping(dictionary): the key is thec_idx, and the value is the indices of sentences the condition is generated from. This is only needed for conditions that are mapped to strings inconditionsdictionary (explained above).
qs.json(list of dictionaries): a list of user scenarios. Each dictionary includes the document indicesdoc_ids(list of integers), the specific conditionsgiven_conditions(list of strings) and boolean values of these conditionsgiven_valuesused to generate the scenario, as well as the actual user scenarioscenario(string).- We consider three types of questions, which are defined as
QUESTIONSinutils.py(1) Can I receive at least one of the following scholarship(s):, (2) Can I receive all the following scholarship(s):, (3) What is the maximum number of scholarship(s) I can receive out of the following scholarship(s):
- We consider three types of questions, which are defined as
rels.json(dictionary): each key-value group describes the relationships among conditions between two documents. The key is a string concatnetation of the twodoc_idsand the value is another dictionary where key is the string concatneation of the twoc_idxfrom the two documents respectively (fromparsed.json) andrelis the type of relationships, which can be conflicting, equivalent, including, and included.- For instance,
means that
{ "1-2": { "c1-c2": { "rel": "conflicting" } } }c1indoc1andc2indoc2are conflicting.
- For instance,
Ground truth answers are not provided as they can be programmatically computed (and saved) by running the main function in get_gold_ans.py.
Your support in improving this dataset is greatly appreciated! If you have any questions or feedback, please send an email to peterbc@mit.edu.