CTPICO wf - Workflow to perform datasets augmentation for PICO NER experiments

Nextflow workflow to perform automatic dataset augmention of a specific set of named entities named PICO (standing for participants, intervention, control and outcomes), gathering and cross-referencing data from ClinicalTrials (CTs) ncbi API and Pubmed abstracts.

Summary

We have developed a workflow with two main modules: (1.1) Acquire/Process up to date completed CTs raw data and validate the presence of the main information required for the entities validation; (1.2) Gather/process pubmed abstracts directly linked in the raw CT structured files; (2) Validation by confronting and testing pairwise similarity between the prediction annotations per PICO-domain entity from NER Fair workflow output and the items extracted for each entity in the clinical trial associated with the respective pubmed ID. At the end, it generates the table with the similarity scores for the annotations and also a folder containing *.ann and *.txt ready to serve as input to the NER Fair workflow for training a new model.

Requirements:

The packages are stated in the environment's exported file: environment.yml

Usage Instructions

Preparation:

git clone https://github.com/YasCoMa/pico_augmentation_workflow.git
cdpico_augmentation_workflow
conda env create --file environment.yml
The workflow requires four parameters, you can edit them in main.nf, or pass in the command line when you start the execution. The parameters are:
- mode: Indicates the goal of the workflow: 'preprocess' or 'validation'. It activates accordingly the steps according to the mode.
- dataDir: The directory where the execution logs will be stored together with the marker files to track the modules already executed.
- runningConfig: A json file with the configuration setup desired by the user. Each main key/parameter of the json file is explained below.
  - outpath: Path to the directory where the workflow modules will store the results. Mandatory. Example: ./validation_ctpico_out
  - config_hpc: Path to the json file containing the cluster access information to launch jobs. Check more details about it in the "HPC execution configuration" section. Example: ./config_hpc.json
  - [Only validation mode] path_prediction_result: Path to the prediction folder after applying a trained model in nerfair workflow to the prepared texts frm the pre-processing module of CTPICO wokflow. Mandatory for validation step. Example: ./nerwf_out_out/expIdentifier-pretrainedModel-finetuned-ner/prediction/
  - [Only validation mode] cutoff_consensus: Cutoff to choose the top ranked annotations based on the similarity score with the clinical trial items. These top ranked entries will be used to generate the final .ann and .txt files that may enter to train a new model. Default value: 0.8.

Run workflow:

Examples of running configuration are shown in running_config.json and eskape_running_config.json
Modes of execution:
- Run Pre-processing:
  - nextflow run main.nf --dataDir /path/to/output_logs --runningConfig /path/to/validation_config.json --mode preprocess
  - This module will generate a folder named "input_prediction" that can be configured in the "input_prediction" parameter of the ner fair workflow configuration file. After running it in the prediction mode, you can look for the prediction results path in the working directory you assigned by the "outpath" parameter. The path to the prediction folder must be configured in the CTPICO wf configuration file in the "path_prediction_result" parameter.
- Run Validation:
  - nextflow run main.nf --dataDir /path/to/output_logs --runningConfig /path/to/validation_config.json --mode validation

Creating singularity image:

Alternatively, you can run the workflow inside a singularity image. In the container folder you will find the recipe (nermatchct.def) with the instructions to build the image. In this folder you will also find the setup.sh execution file that builds the sand box and the final image file. You only need to change the variable LOCAL_REPO in line 6 of setup.sh to assign the full path to the container folder. Then you just have to run it: cd container/ ./setup.sh

HPC execution configuration:

Although the workflow steps can be executed completely in a sequential mode. If you associate an HPC configuration file path in the running setup file using the "config_hpc" parameter, some sub steps will distribute tasks to array jobs and the execution time may decrease significantly depending on the amount of inut data.

There is an example of such hpc config. file (config_hpc.json) in this repository, and you have to configure its parameters. It needs all the parameters already structured in the example file, you just have to change following terms on it:

_tmpDir_ : temporary directory path
_pathToImage_ : path to the .simg file corresponding to the singularity image that was built following the instructions of the previous section
_hpcEnv_ : type of hpc environment of the server (it expects either sge or slurm)
_hostIp_ : IP address of the server/head node
_sshUser_ : authentication user
_sshPass_ : authentication password
_queueId_ : queue identifier, if applicable
_nodeList_ : specific nodes to which you want to send the job to compute, if applicable
_partition_ : specific nodes partition, if applicable

For those terms that you want to leave empty just remove the replacer and let it as "''" The build of the singularity image is mandatory to use the workflow in a cluster.

Reference

A complete report applying this tool in experiments can be found at https://github.com/YasCoMa/ner-fair-workflow/tree/master/experiment_report

Bug Report

Please, use the Issues tab to report any bug.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
container		container
modules		modules
supplementary_scripts		supplementary_scripts
utils		utils
.nextflow.log		.nextflow.log
.nextflow.pid		.nextflow.pid
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
config_hpc.json		config_hpc.json
environment.yml		environment.yml
main.nf		main.nf
nohup.out		nohup.out
pipeline.png		pipeline.png
readme.md		readme.md
validation_config.json		validation_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CTPICO wf - Workflow to perform datasets augmentation for PICO NER experiments

Summary

Requirements:

Usage Instructions

Preparation:

Run workflow:

Creating singularity image:

HPC execution configuration:

Reference

Bug Report

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CTPICO wf - Workflow to perform datasets augmentation for PICO NER experiments

Summary

Requirements:

Usage Instructions

Preparation:

Run workflow:

Creating singularity image:

HPC execution configuration:

Reference

Bug Report

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages