Skip to content

tvbat/sci-text-miner-scimdix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

SciMDIX dataset

The SciMDIX dataset comprises abstracts of scientific articles collected from publicly available sources and annotated with entities, relations, and aspects.

Key Features

  • Format: Structured CSV
  • Domains: IT, Linguistics, Medicine, and Psychology
  • Size: 206 documents in Kazakh, 206 documents in Russian

Related Experiments

Transfer learning experiments on aspect extraction using the scimdix-aspects version of this dataset are available in the Aspect Extraction for Scientific Texts repository.

Important Usage Notes

The abstracts contained within this dataset remain the intellectual property of their authors and publishers.

License

This dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

How to Cite

If you find this repository useful, feel free to cite our paper:

  1. Batura T., Yerimbetova A., Mukazhanov N., Shvarts N., Sakenov B., Turdalyuly M. Information Extraction from Multi-Domain Scientific Documents: Methods and Insights. Applied Sciences. MDPI. 2025. V.15, 9086.
@article{scimdix2025,
author = {Batura, Tatiana and Yerimbetova, Aigerim and Mukazhanov, Nurzhan and Shvarts, Nikita and Sakenov, Bakzhan and Turdalyuly, Mussa},
title = {Information Extraction from Multi-Domain Scientific Documents: Methods and Insights},
journal = {Applied Sciences},
volume = {15},
year = {2025},
number = {16},
article-number = {9086},
publisher = {MDPI},
doi = {https://doi.org/10.3390/app15169086}
}
  1. Shvarts N., Batura T., Mukazhanov N., Yerimbetova A., Turdalyuly M., Sakenov B. SciMDIX: A dataset for aspect extraction from multi-domain scientific documents in Kazakh and Russian. Procedia Computer Science. 2026. V. 275, pp.474-483.
@article{scimdix2026,
  title={SciMDIX: A dataset for aspect extraction from multi-domain scientific documents in Kazakh and Russian},
  author={Shvarts, Nikita and Batura, Tatiana and Mukazhanov, Nurzhan and Yerimbetova, Aigerim and Turdalyuly, Mussa and Sakenov, Bakzhan},
  journal={Procedia Computer Science},
  volume={275},
  pages={474--483},
  year={2026},
  publisher={Elsevier},
  doi = {https://doi.org/10.1016/j.procs.2026.01.056}
}

About

SciMDIX dataset for the paper “Information Extraction from Multi-Domain Scientific Documents: Methods and Insights"

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors