Skip to content

Conversation

@ar0ch
Copy link
Contributor

@ar0ch ar0ch commented Nov 25, 2019

I'm not sure if there's any interest in this PR but I use TreeToReads quite a bit to simulate data for courses I teach and try and incorporate additional errors and variance to make read sets more realistic. This PR adds support for all the relevant ART parameters for more control over the simulation (and adding "randomness"). This also adds supports for threads (via the threads config item) and pigz, if available, for compression. All the updates are backwards compatible, so existing TTR config files will work the exact same way.

New or updated config options:

threads: Number of threads for read simulation, if not set or set to {0,1} only 1 thread is used
read_length: Now supports comma separated lists of read length (e.g. 150,250) to generate reads with variable lengths

Fragment length can be randomized using the frag_{low,high,step} options
frag_low : Start of random frag length range
frag_high: End of random frag length range
frag_step: Step size for binning frag length (e.g. 5 = count from low to high by 5's)

Fragment length standard deviation can be randomized using the stdev_frag_{low,high,step} options
stdev_frag_low : Start of random frag length stdev range
stdev_frag_high: End of random frag length stdev range
stdev_frag_step: Step size for binning frag length stdev (e.g. 5 = count from low to high by 5's)

Average genomic read coverage. Can be uniform [coverage] across dataset or randomly distributed between [coverage_low, coverage_high].
coverage = 20
coverage_low: Start of random coverage range
coverage_high: End of random coverage range
coverage_step: Step size for binning random coverage

Average QS2 bias to introduce. Can be 0 (no value provided) across dataset or randomly distributed between [qs2_low, qs2_high]
qs2_low: Start of QS2 bias range
qs2_high: End of QS2 bias range
qs2_step: Step size for binning QS2 bias

Platform read error profile used to generate reads. Supports the following (between read length), and defaults to MiSeq v3 to allow reads up to 250bp. If read lengths are provided that are greater than the profile(s) select support, they'll be truncated to the maxmin of the profile(s) selected

GA1 - GenomeAnalyzer I (1-44bp), GA2 - GenomeAnalyzer II (1-75bp)
HS10 - HiSeq 1000 (1-100bp), HS20 - HiSeq 2000 1-100bp), HS25 - HiSeq 2500 (1-150bp)
HSXn - HiSeqX PCR free (1-150bp), HSXt - HiSeqX TruSeq (1-150bp),  MinS - MiniSeq TruSeq (1-50bp)
MSv1 - MiSeq v1 (1-250bp), MSv3 - MiSeq v3 (1-250bp), NS50 - NextSeq500 v2 (1-75bp)

@snacktavish
Copy link
Owner

Thanks so much! It makes me happy to hear you are using it, and I appreciate the PR. It'll take a look over the next few days, and merge soon!

@snacktavish
Copy link
Owner

snacktavish commented Dec 5, 2019

I started taking a look at this - but the example and tests aren't running.
When I clone your version and run the example I get an error.

$python treetoreads.py example.config
Random seed is 2860921143519957845
Running TreetoReads using configuration file example.config
Arguments read
Using 10 threads for read simulation
Number of variable sites is 20
clustering proportion is 0.25
exponential_mean is 2
Using uniform coverage, coverage = 20
Using uniform fragment size
Using fixed-size stdev
No QS2 bias applied
Invalid read profile provided, M, setting to MSv3
Traceback (most recent call last):
  File "treetoreads.py", line 1091, in <module>
    ttr = TreeToReads(configfi=args.config_file, main=1)
  File "treetoreads.py", line 44, in __init__
    self._check_args()
  File "treetoreads.py", line 330, in _check_args
    self.argdict['readProfile'][self.argdict['readProfile'].index(profile)] = 'MSv3'
TypeError: 'str' object does not support item assignment

I can dig into after the semester is over - but in the meantime, any ideas?

@ar0ch
Copy link
Contributor Author

ar0ch commented Dec 5, 2019

Alright, this should fix that error. All the ART params that are/can be randomized have to be stored in lists and I was failing to do that when no readProfile config item was provided.

@snacktavish
Copy link
Owner

snacktavish commented Jan 18, 2020

Thanks for the update and sorry for my super slow replies! I'm still getting an error running the example.config

Simulating FASTQ readsTraceback (most recent call last):
  File "treetoreads.py", line 1088, in <module>
    ttr = TreeToReads(configfi=args.config_file, main=1)
  File "treetoreads.py", line 52, in __init__
    self.run_art()
  File "treetoreads.py", line 932, in run_art
    with Pool(processes=self.threads) as pool:
AttributeError: __exit__

Potentially because I am running an older version of Art?

AR`T_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)

@ar0ch
Copy link
Contributor Author

ar0ch commented Jan 20, 2020

It's no problem, we're all busy! By chance are you using Python 2.x or an early 3.x? As far as I'm aware multiprocessing.Pool doesn't support context management (with ... statements) pre-Python 3.4 (or there abouts). If you want to maintain Py2.x compatibility, I can rewrite the threading to be compatible.

@snacktavish
Copy link
Owner

Ah right. I was running it in a python2 virtual env. Even though it is 2020 it is probably worth maintaining compatibility. I made a code on comment on a simple way to do that, as well as a fix to the zipping with pigz. Otherwise looks great and is much quicker!!

@ar0ch
Copy link
Contributor Author

ar0ch commented Jan 21, 2020

OK, this should give us Python 2.7.x and 3.y (where x > 6 and y > 4) compatibility with threads still.

python2 --version; python2 treetoreads.py example.config > py2.log; diskus example_out; rm -rf example_out
Python 2.7.11
Using 10 threads for read simulation
Using uniform coverage, coverage = 20
Using uniform fragment size
Using fixed-size stdev
No QS2 bias applied
Read length greater than allowed by profile, setting to profile max
2.83 MB (2,834,432 bytes)

And Python3

python3 --version; python3 treetoreads.py example.config > py3.log; diskus example_out; rm -rf example_out
Python 3.7.3
Using 10 threads for read simulation
Using uniform coverage, coverage = 20
Using uniform fragment size
Using fixed-size stdev
No QS2 bias applied
Read length greater than allowed by profile, setting to profile max
2.77 MB (2,772,992 bytes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants