The SciMDIX dataset comprises abstracts of scientific articles collected from publicly available sources and annotated with entities, relations, and aspects.
- Format: Structured CSV
- Domains: IT, Linguistics, Medicine, and Psychology
- Size: 206 documents in Kazakh, 206 documents in Russian
Transfer learning experiments on aspect extraction using the scimdix-aspects version of this dataset are available in the Aspect Extraction for Scientific Texts repository.
The abstracts contained within this dataset remain the intellectual property of their authors and publishers.
This dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
If you find this repository useful, feel free to cite our paper:
- Batura T., Yerimbetova A., Mukazhanov N., Shvarts N., Sakenov B., Turdalyuly M. Information Extraction from Multi-Domain Scientific Documents: Methods and Insights. Applied Sciences. MDPI. 2025. V.15, 9086.
@article{scimdix2025,
author = {Batura, Tatiana and Yerimbetova, Aigerim and Mukazhanov, Nurzhan and Shvarts, Nikita and Sakenov, Bakzhan and Turdalyuly, Mussa},
title = {Information Extraction from Multi-Domain Scientific Documents: Methods and Insights},
journal = {Applied Sciences},
volume = {15},
year = {2025},
number = {16},
article-number = {9086},
publisher = {MDPI},
doi = {https://doi.org/10.3390/app15169086}
}- Shvarts N., Batura T., Mukazhanov N., Yerimbetova A., Turdalyuly M., Sakenov B. SciMDIX: A dataset for aspect extraction from multi-domain scientific documents in Kazakh and Russian. Procedia Computer Science. 2026. V. 275, pp.474-483.
@article{scimdix2026,
title={SciMDIX: A dataset for aspect extraction from multi-domain scientific documents in Kazakh and Russian},
author={Shvarts, Nikita and Batura, Tatiana and Mukazhanov, Nurzhan and Yerimbetova, Aigerim and Turdalyuly, Mussa and Sakenov, Bakzhan},
journal={Procedia Computer Science},
volume={275},
pages={474--483},
year={2026},
publisher={Elsevier},
doi = {https://doi.org/10.1016/j.procs.2026.01.056}
}