Pareto optimization of masked superstrings

Supplementary repository for paper Pareto optimization of masked superstrings improves compression of pan-genome k-mer sets. [TODO add bioRxiv link when possible]

This repository contains the implementation of Pareto optimization of masked superstrings for k-mer sets and computation of lower bound for the number of runs of ones in the mask (or the number of matchtigs).

It also contains Snakemake pipelines reproducing the results from the paper, and uses conda to manage software versions and most of the dependencies.

In case of any questions, feel free to ask by an email.

Howto

Prepare

To install dependencies from conda and build localy provided software, use:

    ./prepare.sh

Next time, you only need to activate the conda environment with:

    conda activate ms-pareto-optimization

Run

To run an experiment, move to the corresponding directory (ex1-..., ex2-..., ex3-...) and run:

    snakemake -j <number_of_threads> <any_optional_parameters>

Experiments produce .tsv files with results and plots (TODO link new R plotting scripts to work from pipeline instead of manually).

Modify

To tweak an experimental setup, modify the Snakefile in the corresponding directory. Constants defined on top of Snakefiles define which datasets, values of k, run penalties, and computation or compression methods are used.

Project structure

Locally installed programs and snakefiles are stored in tools directory.

Other directory names can be modified in tools/project_structure.smk if needed. Default names are:

Datasets are downloaded into data directory.
Computed superstring representations are stored in computed directory (it is recommended to turn off file search indexing for this directory).
Results are stored in respective numbered directories (ex1-..., ex2-..., ...).

Datasets

If you want to use the pipeline with custom datasets, you can modify datasets.txt. Add the dataset name and url, the pipeline downloads datasets automatically and can handle xz-compressed and uncompressed FASTA files.

In case you need to support other compression formats, the simplest way is to modify the download_data rule in tools/download.smk.

Implementation

Details about the implementation are provided in separate README.

The work was implemented inside a fork of KmerCamel🐫 and then copied over to another fork, which results in a weird commit history with most of the relevant changes in the first commit made by @Jajopi (d86ccf1a6b8007a6cb8680bb46f47a91ef058beb). The main concept (Pareto optimization) was internally called Joint optimization for a long time.

Name		Name	Last commit message	Last commit date
Latest commit History 419 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
data		data
ex1-kmers-pareto-length-runs		ex1-kmers-pareto-length-runs
ex2-compression-joint-vs-matchtigs		ex2-compression-joint-vs-matchtigs
ex3-compression-mask-only-elias-fano		ex3-compression-mask-only-elias-fano
figures		figures
kmercamel-pareto		kmercamel-pareto
tools		tools
README.md		README.md
datasets.txt		datasets.txt
environment.yml		environment.yml
prepare.sh		prepare.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pareto optimization of masked superstrings

Howto

Prepare

Run

Modify

Project structure

Datasets

Implementation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pareto optimization of masked superstrings

Howto

Prepare

Run

Modify

Project structure

Datasets

Implementation

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages