Source code of the publication: Classification of Veterinary Subjects in Medical Literature and Clinical Summaries
Note: Due to the storage space, all trained or fine-tuned models and the dataset are stored on kaggle. However, these can be loaded via the kaggle API, as implemented in the code.
├── Source_code | Folder for the source code
├── a_problem_analysis | Folder for problem analysis
├── analysis_pmc_patients.ipynb | Notebook analyzing PMC patients data
└── problem_analysis.ipynb | Notebook for general problem analysis
├── b_dataset_generation | Folder for dataset generation
├── data | Folder containing data queried from PubMed
│ ├── human_medical_data | Folder for human medical data
│ │ ├── BMJ_data.xml | XML file for for texts of journal BMJ
│ │ └── NEJM_data.xml | XML file for for texts of journal NEJM
│ └── veterinary_medical_data | Folder for veterinary data
│ ├── Animals_data.xml | XML file for texts of journal Animals
│ └── ... | Other XML files for veterinary journal texts
├── pubmed_queries | Folder for PubMed queries
│ ├── api_key.txt | API key for PubMed
│ ├── docker-compose.yaml | Docker Compose configuration
│ ├── Dockerfile | Dockerfile for PubMed setup
│ ├── edirect.py | Python script for EDirect setup
│ ├── edirect_installation.sh | Shell script for EDirect installation
│ ├── library_options.ipynb | Notebook for library options
│ ├── query.py | Python script for PubMed queries
│ └── requirements.txt | Requirements file for PubMed setup
└── dataset_generation.ipynb | Notebook for dataset generation
├── c_model_training_fine_tuning | Folder for model training and fine-tuning
├── plm_fine_tuning.ipynb | Notebook for PLM fine-tuning
└── svm_training.ipynb | Notebook for SVM training
├── d_model_testing | Folder for model testing
├── plm_testing.ipynb | Notebook for PLM testing
└── svm_testing.ipynb | Notebook for SVM testing
├── e_model_explanation | Folder for model explanation
├── rare_animals.ipynb | Notebook for analysis of texts containing rare animals
├── svm_coefficients.ipynb | Notebook for SVM coefficients analysis
└── word_importance.ipynb | Notebook for word importance analysis
├── f_others | Folder for other analyses
└── hardware_analysis.ipynb | Notebook for hardware analysis
└── z_utils | Folder for utility scripts and classes
├── BERTClassifier.py | Python script for BERT classifier
├── BlueBERTClassifier.py | Python script for BlueBERT classifier
├── data_preparing.py | Python script for data preparation
├── data_preprocessing.py | Python script for data preprocessing
├── Dataset.py | Python script for dataset class
├── DeBERTaClassifier.py | Python script for DeBERTa classifier
├── evaluate.py | Python script for model evaluation
├── global_constants.py | Python script for global constants
├── lemmatize.py | Python script for text lemmatization
├── loss_fn.py | Python script for loss function
├── plot.py | Python script for plotting
├── predict.py | Python script for prediction
├── RoBERTaClassifier.py | Python script for RoBERTa classifier
├── train.py | Python script for PLM training
└── XLNetClassifier.py | Python script for XLNet classifier
├── README.md | Readme file
├── requirements.txt | Requirements file
└── setup.py | Setup file
Note: All results of the master thesis were obtained in Kaggle Notebooks. Local execution of the code may lead to deviating results. To install, follow the steps below:
- Clone the repository:
git clone https://github.com/marcel8168/medtextclassification medtextclassification- Create a virtual environment
cd medtextclassification
python -m venv venv
venv\Scripts\activate.bat- Install PyTorch for computations on CUDA (see How to install PyTorch). Select CUDA as compute platform.
- Install the requirements
pip install -e .
pip install -r requirements.txt- To be able to load datasets and models used in this repository first set username and API key from kaggle (see How to get API key)
# linux
export KAGGLE_USERNAME=xxxxxxxxxxxxxx
export KAGGLE_KEY=xxxxxxxxxxxxxx
# windows
SET KAGGLE_USERNAME=xxxxxxxxxxxxxx
SET KAGGLE_KEY=xxxxxxxxxxxxxx- Optional: For querying PubMed first copy your API key from PubMed (see How to get API key) into api_key.txt
| Description | Link |
|---|---|
| BERT Model | Link |
| RoBERTa Model | Link |
| DeBERTa Model | Link |
| BlueBERT Model | Link |
| XLNet Model | Link |
| SVM Model | Link |
MIT License (Marcel Hiltner, 2024)