Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
316 commits
Select commit Hold shift + click to select a range
85d817a
added multiprocessing to phydms_modeladequacy
skhilton Nov 27, 2018
74d9063
cleaned up the test directory
skhilton Nov 28, 2018
15e22b3
multiprocessing is used only if ncpu greater than one
skhilton Nov 28, 2018
6bef390
added test for model adequacy with and without multiprocessing
skhilton Nov 28, 2018
c8ff083
small formatting tweaks
skhilton Nov 28, 2018
30ee720
refactored code to perform simulations in batches
skhilton Nov 30, 2018
da02af9
make sure there are >= 100 simulations per CPU
skhilton Nov 30, 2018
fc67e3f
increased number of simulations in multiprocessing test
skhilton Nov 30, 2018
c1da353
removed slow list comprehension in kldiv function
skhilton Nov 30, 2018
c2586d4
cleaned up code around multiprocessor/no multiprocessor switch
skhilton Nov 30, 2018
f329a16
fixed sites being indexed at 0 to indexed at 1 for outputs
skhilton Nov 30, 2018
066b714
streamlined processing of the simulations
skhilton Nov 30, 2018
8b5bf32
changed min number of simuluations per CPU to 1000
skhilton Nov 30, 2018
37e8caf
Merge branch 'master' into model_adequacy
jbloom Dec 12, 2018
67f0758
Update .travis.yml
skhilton Dec 13, 2018
4993109
ignore failing tests
skhilton Dec 13, 2018
7ee2e5c
specified float formatting in doctest
skhilton Dec 13, 2018
03d6fe6
Merge branch 'model_adequacy' of https://github.com/jbloomlab/phydms …
skhilton Dec 13, 2018
07baa70
made test file paths more robust
skhilton Dec 14, 2018
bed0509
removed call to raxml and made file paths more robust
skhilton Dec 14, 2018
26dfc85
fixed bug in CPU distribution
skhilton Dec 14, 2018
4e66ec2
Updated travis.yml
jon-mah Dec 14, 2018
fc9b24a
Ignore failing modeladequacy tests
jon-mah Dec 15, 2018
0df81b2
cleaned up distance calculations
skhilton Dec 17, 2018
32ad1c7
sequestered failing test
skhilton Dec 17, 2018
6d69418
fixed doctest for prefdistance
skhilton Dec 17, 2018
d16b18c
added a dictionary
skhilton Dec 17, 2018
1a61199
clean up multiprocessing in model adequacy
skhilton Dec 17, 2018
7328f9d
clean up random seed setting in simulator.simulate
skhilton Dec 17, 2018
466efb4
added layered cacheing to simulator.simulate
skhilton Dec 17, 2018
2d079d5
added test to make sure 1 CPU and > 1 CPU get the same answer
skhilton Dec 17, 2018
9af4de2
Absolute pathing, working build
jon-mah Dec 17, 2018
61f8ceb
fixed translate with gaps function in compare simulators tests
skhilton Dec 17, 2018
6e2544b
Shortened input data, updated expected results, working build.
jon-mah Dec 17, 2018
389f06f
set check_input default value to True
skhilton Dec 17, 2018
3d3ccf8
Absolute pathing for expected outputs.
jon-mah Dec 17, 2018
097ed1d
Absolute pathing for all files.
jon-mah Dec 17, 2018
4694f7b
Refactored tests to resemble test_phydms_comprehensive
jon-mah Dec 17, 2018
9a84e99
small tweak to multiprocessing
skhilton Dec 17, 2018
f058832
Added tests for simulated data, associated expected results, working …
jon-mah Dec 18, 2018
142c0b9
added the correct file for the phydms_modeladequacy test
skhilton Jan 10, 2019
52c9814
small tweak to test description
skhilton Jan 10, 2019
2a3e99f
fixed phydms modeladequacy multiprocessing test script
skhilton Jan 10, 2019
19b1ae1
fixed multiprocessing in phydms_modeladequacy
skhilton Jan 10, 2019
f720646
small tweaks to test formatting
skhilton Jan 10, 2019
26df9d4
formatting tweaks to model adequacy script
skhilton Jan 10, 2019
709943c
tweaks to script
skhilton Jan 11, 2019
6e56ab7
added phydms_modeladequacy tests to Travis CI
skhilton Jan 11, 2019
39ff94a
increased the number of simulations/CPU to 1000
skhilton Jan 11, 2019
fa1269b
print out expected and actual results for failing tests
skhilton Jan 11, 2019
0e75b5f
fixed typo
skhilton Jan 11, 2019
ee90a52
fixed formatting on error message
skhilton Jan 11, 2019
614df9b
fixed test typo
skhilton Jan 11, 2019
8a5378b
added tolerance to tests
skhilton Jan 11, 2019
ea83bab
fixed typo in error message
skhilton Jan 11, 2019
300c691
increased the tolerance
skhilton Jan 12, 2019
52e6925
removed tolerance in model adequacy test
skhilton Jan 15, 2019
5d8114d
fixed typo in test
skhilton Jan 15, 2019
ec41176
ignore multiprocessing test
skhilton Jan 25, 2019
227b173
switched to NP data
skhilton Jan 25, 2019
ba87db7
reduced number of simulations
skhilton Jan 25, 2019
0779e2b
changed multiprocessing test to NP data
skhilton Jan 25, 2019
5cb276d
removed obsolete input data directory
skhilton Jan 25, 2019
f46c353
removed old files
skhilton Jan 25, 2019
73305cf
utilize full paths for RAxML
skhilton Jan 31, 2019
cf22928
added RMSD metric to phydms_modeladequacy
skhilton Feb 1, 2019
a191657
updated tests to include RMSD results
skhilton Feb 1, 2019
ab38af8
added max number of simulations per CPU
skhilton Feb 13, 2019
cd36531
Merge branch 'model_adequacy' of https://github.com/jbloomlab/phydms …
skhilton Feb 13, 2019
f4da63a
fixed inability to handle a large number of simulations
skhilton Feb 14, 2019
ef2a035
Added gammaomega and empirical Bayes capabilities to phydms_comprehen…
jon-mah Feb 14, 2019
0a92517
Split phydms_comprehensive tests into two, added --ncats flag functio…
jon-mah Feb 15, 2019
1038bd5
shortened diff prefs test
skhilton Feb 16, 2019
43a59b7
Merge model_adequacy to REL, resolve conflict in pytest.ini
jon-mah Feb 21, 2019
b5eee44
Added test cases for divpressure. Tests expected to fail.
jon-mah Mar 7, 2019
0e1b1eb
added rateMatrixPrefix option to phydmslib.simulate
skhilton Oct 15, 2018
706871f
started the phydmslib.modeladequacy modules
skhilton Oct 16, 2018
ce4a3df
fixed typo in calc_aa_frequencies doctest
skhilton Oct 17, 2018
99422c2
small tweaks to calc_aa_frequencies
skhilton Oct 17, 2018
6a61fde
added sequence length check to translate_with_gaps
skhilton Oct 17, 2018
08e8c96
added more functions to modeladequacy and started test script
skhilton Oct 19, 2018
efba620
small tweaks to test script
skhilton Oct 19, 2018
05ee0a6
renamed variable to follow lowercase conventions
skhilton Oct 19, 2018
fd124c9
seed numpy random seed in simulate
skhilton Oct 19, 2018
f6ef303
added ability to do replicate simulations
skhilton Oct 26, 2018
4bf369d
first pass at model_adequacy script
skhilton Nov 2, 2018
03515a6
small tweak to parse arguments
skhilton Nov 2, 2018
112a5ef
added profiling script for modeladequacy
skhilton Nov 2, 2018
75c8f1e
added pstats output
skhilton Nov 2, 2018
77a6f59
added script to profile model adequacy
skhilton Nov 6, 2018
87f0a29
added additional checks to the replicate alignment test
skhilton Nov 6, 2018
5e82363
removed extraneous file
skhilton Nov 6, 2018
a5c5199
fixed bug in profiling script
skhilton Nov 6, 2018
ead9612
small tweak to profile script
skhilton Nov 6, 2018
c7d9a7b
started test for model adequacy test
skhilton Nov 6, 2018
68338b9
added new class simulator to simulate.py
skhilton Nov 7, 2018
9e76ea7
added new simulation method to the simulation test
skhilton Nov 7, 2018
7686fad
tweaks to simulate.py
skhilton Nov 8, 2018
8290fa3
refactored alignment simulation test
skhilton Nov 8, 2018
e9fff26
changes to simulate.py
skhilton Nov 8, 2018
587370b
changed node attributes in simulator to private
skhilton Nov 9, 2018
219dc21
small tweaks to simulatoin test
skhilton Nov 9, 2018
963e8fb
added Simulator class to random seed test
skhilton Nov 9, 2018
5cb7cb5
added profile code and profiling results
skhilton Nov 9, 2018
cdd3697
some changes that speed up `Simulate.simulate`
jbloom Nov 9, 2018
05daf6c
added test to check simulator against expected results
skhilton Nov 9, 2018
a153d58
small tweaks to simulate.py
skhilton Nov 9, 2018
1c2123b
added comparison test
skhilton Nov 10, 2018
32f56da
small tweak to docstring
skhilton Nov 10, 2018
9574d03
removed old profiling output
skhilton Nov 13, 2018
56acf0d
removed unused function from phydmslib/modeladequacy
skhilton Nov 13, 2018
a6ace82
model adequacy script runs through all the way
skhilton Nov 13, 2018
5dd945a
cache the cumulative sums and sequence arrays
skhilton Nov 13, 2018
bb42834
added function to output an alignment
skhilton Nov 13, 2018
ab5e8c2
added another test for random seed setting
skhilton Nov 13, 2018
16526ab
added pvalue calculation function to modeladequacy
skhilton Nov 14, 2018
7ac2968
added in random seed and simplified simulation processing
skhilton Nov 14, 2018
7d06d25
removed unnecessary test files
skhilton Nov 14, 2018
0f96097
added YNGKP_M0 to phydms_modeladequacy and fixed random seed
skhilton Nov 14, 2018
4546e06
added test of phydms_modeladequacy against expected results
skhilton Nov 14, 2018
0cc8553
small formatting tweaks
skhilton Nov 14, 2018
2414996
fixed how the random seed is set in simulator and updated expected re…
skhilton Nov 19, 2018
2672a2c
small tweaks to simulate.py
skhilton Nov 19, 2018
4a0a662
removed metrics flag
skhilton Nov 19, 2018
1c1f27d
changed default number of simulations and clarified models accepted
skhilton Nov 19, 2018
29801be
seed each call of simulator
skhilton Nov 19, 2018
7d3de28
added more interpretable error message
skhilton Nov 19, 2018
7804559
removed unncessary function
skhilton Nov 19, 2018
a9d25ae
update phydms_modeladequacy test with nSim as optional param
skhilton Nov 20, 2018
da8c378
fixed random seed setting in test_alignmentSimulation.py
skhilton Nov 20, 2018
7dd0fa4
refactored simulate code to only store codon index
skhilton Nov 20, 2018
a96f1fd
cleaned up comments
skhilton Nov 20, 2018
db01f80
small tweaks to simulate.py
skhilton Nov 20, 2018
e05567b
cleaned up variable names
skhilton Nov 20, 2018
5a09ff8
replaced for loop with starmap
skhilton Nov 21, 2018
41ddd49
set numpy random seed as well as scipy random seed
skhilton Nov 27, 2018
394520c
implemented with itertools starmap
skhilton Nov 27, 2018
b0db690
set random seed during pvalue calculation
skhilton Nov 27, 2018
2d7795d
fixed seed setting in pvalue calculation function
skhilton Nov 27, 2018
66c18bb
added multiprocessing to phydms_modeladequacy
skhilton Nov 27, 2018
2d0d5cb
cleaned up the test directory
skhilton Nov 28, 2018
af65bca
multiprocessing is used only if ncpu greater than one
skhilton Nov 28, 2018
e190e0d
added test for model adequacy with and without multiprocessing
skhilton Nov 28, 2018
2c92763
small formatting tweaks
skhilton Nov 28, 2018
874f564
refactored code to perform simulations in batches
skhilton Nov 30, 2018
a4f596b
make sure there are >= 100 simulations per CPU
skhilton Nov 30, 2018
3d25eed
increased number of simulations in multiprocessing test
skhilton Nov 30, 2018
b24fcaa
removed slow list comprehension in kldiv function
skhilton Nov 30, 2018
251d4ff
cleaned up code around multiprocessor/no multiprocessor switch
skhilton Nov 30, 2018
2eeffc9
fixed sites being indexed at 0 to indexed at 1 for outputs
skhilton Nov 30, 2018
d8529e7
streamlined processing of the simulations
skhilton Nov 30, 2018
2c8f805
changed min number of simuluations per CPU to 1000
skhilton Nov 30, 2018
f3b7b95
ignore failing tests
skhilton Dec 13, 2018
f8ddc8a
specified float formatting in doctest
skhilton Dec 13, 2018
679330e
made test file paths more robust
skhilton Dec 14, 2018
b9c8bb5
removed call to raxml and made file paths more robust
skhilton Dec 14, 2018
e123249
fixed bug in CPU distribution
skhilton Dec 14, 2018
39c8864
Updated travis.yml
jon-mah Dec 14, 2018
3fd8cc9
Ignore failing modeladequacy tests
jon-mah Dec 15, 2018
d342bbb
cleaned up distance calculations
skhilton Dec 17, 2018
fa766a2
fixed doctest for prefdistance
skhilton Dec 17, 2018
5b7e722
added a dictionary
skhilton Dec 17, 2018
7cd0a64
clean up multiprocessing in model adequacy
skhilton Dec 17, 2018
f3b4ce3
clean up random seed setting in simulator.simulate
skhilton Dec 17, 2018
3f40948
added layered cacheing to simulator.simulate
skhilton Dec 17, 2018
7243387
added test to make sure 1 CPU and > 1 CPU get the same answer
skhilton Dec 17, 2018
7d1f1c0
Absolute pathing, working build
jon-mah Dec 17, 2018
1729e25
fixed translate with gaps function in compare simulators tests
skhilton Dec 17, 2018
ac1e8c0
Shortened input data, updated expected results, working build.
jon-mah Dec 17, 2018
3778653
set check_input default value to True
skhilton Dec 17, 2018
fd02bdc
Absolute pathing for expected outputs.
jon-mah Dec 17, 2018
cb9782f
Absolute pathing for all files.
jon-mah Dec 17, 2018
417c21e
Refactored tests to resemble test_phydms_comprehensive
jon-mah Dec 17, 2018
84f7a31
small tweak to multiprocessing
skhilton Dec 17, 2018
6745829
Added tests for simulated data, associated expected results, working …
jon-mah Dec 18, 2018
02bd085
added the correct file for the phydms_modeladequacy test
skhilton Jan 10, 2019
0e6a7bc
small tweak to test description
skhilton Jan 10, 2019
e37e67b
fixed phydms modeladequacy multiprocessing test script
skhilton Jan 10, 2019
04a12d9
fixed multiprocessing in phydms_modeladequacy
skhilton Jan 10, 2019
7c37dd8
small tweaks to test formatting
skhilton Jan 10, 2019
2b8df92
formatting tweaks to model adequacy script
skhilton Jan 10, 2019
fb18008
tweaks to script
skhilton Jan 11, 2019
ce39b63
added phydms_modeladequacy tests to Travis CI
skhilton Jan 11, 2019
0c4ff99
increased the number of simulations/CPU to 1000
skhilton Jan 11, 2019
9a6a9e0
print out expected and actual results for failing tests
skhilton Jan 11, 2019
2a6745b
fixed typo
skhilton Jan 11, 2019
bf08921
fixed formatting on error message
skhilton Jan 11, 2019
77f471e
fixed test typo
skhilton Jan 11, 2019
36564ed
added tolerance to tests
skhilton Jan 11, 2019
39ace9f
fixed typo in error message
skhilton Jan 11, 2019
927c07f
increased the tolerance
skhilton Jan 12, 2019
b342c2d
removed tolerance in model adequacy test
skhilton Jan 15, 2019
8127b78
fixed typo in test
skhilton Jan 15, 2019
07ba729
ignore multiprocessing test
skhilton Jan 25, 2019
7961c23
switched to NP data
skhilton Jan 25, 2019
1a31622
reduced number of simulations
skhilton Jan 25, 2019
9b8f0e5
changed multiprocessing test to NP data
skhilton Jan 25, 2019
092e76c
removed obsolete input data directory
skhilton Jan 25, 2019
5eef07b
removed old files
skhilton Jan 25, 2019
9a5bc07
utilize full paths for RAxML
skhilton Jan 31, 2019
a2c4c72
added RMSD metric to phydms_modeladequacy
skhilton Feb 1, 2019
5f6cdbf
updated tests to include RMSD results
skhilton Feb 1, 2019
5348031
added max number of simulations per CPU
skhilton Feb 13, 2019
72c4ddf
fixed inability to handle a large number of simulations
skhilton Feb 14, 2019
39229db
Added gammaomega and empirical Bayes capabilities to phydms_comprehen…
jon-mah Feb 14, 2019
822b33e
Split phydms_comprehensive tests into two, added --ncats flag functio…
jon-mah Feb 15, 2019
223b66b
shortened diff prefs test
skhilton Feb 16, 2019
7bdfef8
Added test cases for divpressure. Tests expected to fail.
jon-mah Mar 7, 2019
77f5f0b
Added minimum branch length argument.
Jan 7, 2020
dd2fff8
resolved merge conflict.
Jan 7, 2020
1352dd4
attempted to shorten tests
skhilton May 27, 2020
ba4521e
tweaked xvfb line in .travis.yml
skhilton May 27, 2020
17e594d
added expected diffprefs file
skhilton May 27, 2020
000e946
use more pandas function for empirical bayes
skhilton May 27, 2020
83a4639
fixed bug in positive selection calculations
skhilton May 27, 2020
10ee00f
fixed comment
skhilton May 27, 2020
ed80a47
Merge branch 'REL' into REL_merge
skhilton May 29, 2020
b24fdc1
fixed bugs introduced during merge
skhilton May 29, 2020
97b39ef
removed model adequacy protocol
skhilton May 29, 2020
fdda0f8
cleaned up phydms_comprehensive test
skhilton May 29, 2020
c2179cf
added branch to travis CI
skhilton May 29, 2020
9f3066f
removed call to model adequacy in test
skhilton May 29, 2020
869c50f
removed model adequacy call from another test
skhilton May 29, 2020
2bbdafb
pytest ignore deprecation warnings
skhilton May 29, 2020
448e311
Added documentation in .rst format for REL method.
jon-mah May 30, 2020
af5c892
fixed typos in REL method docs.
jon-mah May 30, 2020
52d7f06
update CHANGELOG and version number to 2.4.0
jbloom May 30, 2020
63a1422
remove `REL` and `REL_merge` branches from Travis tests now that bein…
jbloom May 30, 2020
20d436f
require python >= 3.6 (we are only testing on this)
jbloom May 30, 2020
c96b2c5
clean up docstring
jbloom May 30, 2020
54fc59a
clean up docstring
jbloom May 30, 2020
b91540d
clean up ExpCM docs
jbloom May 30, 2020
55cedac
split empirical_bayes flag into two separate arguments, random_effect…
jon-mah Jun 15, 2020
1119769
small tweaks to parseargs
skhilton Jun 16, 2020
3dc271e
added REL checks to the model check portion of phydms
skhilton Jun 16, 2020
1864169
tweaks to phydms_comprehensive test
skhilton Jun 16, 2020
6a53bf2
phydms_prog for random_effects_likelihood and REL_ncats options.
jon-mah Jun 25, 2020
ad03e53
Merge branch 'REL_merge' of https://github.com/jbloomlab/phydms into …
jon-mah Jun 25, 2020
f09fb50
Merge branch 'master' into REL_merge
skhilton Jun 25, 2020
d01985c
Merge branch 'master' into REL_merge.
jon-mah Jun 25, 2020
42b5350
phdyms_prog docs for REL output files.
jon-mah Jun 25, 2020
850db42
Merge branch 'REL_merge' of https://github.com/jbloomlab/phydms into …
skhilton Jun 25, 2020
b1793ce
tweaks to phydms_prog
skhilton Jun 25, 2020
30e78d5
draft docs for numerical implementation of REL method.
jon-mah Jun 26, 2020
585ccef
removed REL from implementation
skhilton Jun 26, 2020
62b8554
changed CLI to omega_random_effects_likelihood
skhilton Jul 1, 2020
3f4dd06
slight doc tweaks
skhilton Jul 1, 2020
405d91e
changed column titles
skhilton Jul 1, 2020
2e89157
added branch to travis
skhilton Jul 1, 2020
4a1496a
Merge pull request #32 from jbloomlab/REL-names
jon-mah Jul 2, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ script:
branches:
only:
- master
- model_adequacy
- REL-names

notifications:
email:
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
Changelog
===========

2.4.0
------
* Added REL method for detecting sites of positive selection.

2.3.8
------
* Fixed bug in `phydms_prepalignment` due to `mafft` shortening sequence names.
Expand Down
45 changes: 43 additions & 2 deletions docs/ExpCM.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
=======================================================

.. contents::
:depth: 2
:depth: 3

Overview
-------------
Expand Down Expand Up @@ -143,9 +143,12 @@ In this case, :math:`\beta` is drawn from ``--ncats`` categories placed at the m
Note that the mean :math:`\beta` value is then :math:`\alpha_{\beta} / \beta_{\beta}`.

Identifying diversifying selection via site-specific :math:`\omega_r` values
------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
One type of interesting selection is *diversifying selection*, where there is continual pressure for amino-acid change. Such selection might be expected to occur at sites that are targeted by adaptive immunity or subjected to some other form of selection which constantly favors changes in the protein sequence. At such sites, we expect that the relative rate of nonsynonymous substitutions will be higher than suggested by the site-specific preferences :math:`\pi_{r,a}` due to this diversifying selection.

FEL-like approach
++++++++++++++++++

To detect diversifying selection at specific sites within the framework of the *ExpCM* implemented in ``phydms``, we use an approach that is highly analogous the *FEL* (**f**\ixed **e**\ffects **l**\ikelihood) method described by `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_. Essentially, the tree topology, branch lengths, and all shared model parameters are fixed to their maximum-likelihood values optimized over the entire gene sequence. Then for each site :math:`r`, we fit a site-specific ratio of the rate of synonymous versus nonsynonymous substitutions while holding all holding all the other tree and model parameters constant. Effectively, this is fitting a different :math:`\omega_r` for each site, and so this analysis is indicated as ``--omegabysite`` in the ``phydms`` options.

Specifically, after fixing all of the other parameters as described above, for each site :math:`r` we re-define Equation :eq:`Frxy_ExpCM` as
Expand All @@ -166,6 +169,44 @@ The null hypothesis is that :math:`\omega_r = 1`. We compute a P-value for rejec

Significant support for a value of :math:`\omega_r > 1` can be taken as evidence for diversifying selection beyond that expected given the constraints encapsulated in the site-specific amino-acid preferences. Significant support for a value of :math:`\omega_r < 1` can be taken as evidence for selection against amino-acid change beyond that expected given the constraints encapsulated in the site-specific amino-acid preferences. Note, however, that if the site-specific preferences don't accurately describe the real constraints, you might get :math:`\omega_r \ne 1` simply because of this fact -- so you will want to examine if sites might be subject to selection that is better described by modulating the stringency parameter :math:`\beta` or by invoking differential preferences, as described below.

REL-like approach
+++++++++++++++++++

In addition to the site-specific FEL-based approach discussed above, we have also implemented an approach that is highly analogous to the REL (**r**\andom **e**\ffects **l**\ikelihood) method described by `Nielsen and Yang, Genetics, 148; 929-936`_.
Rather than fitting the ratio of the rate of synonymous versus nonsynonymous substitutions for *each* site, the REL approach involves fitting a discretized gamma distribution of omega values across *all* sites.
When fitting this gamma distribution of :math:`\omega`, we let :math:`\omega` values be drawn from *K* discrete categories, with each category given equal proportion.
This gamma distribution is described by a shape parameter, :math:`\alpha_{\omega}` and an inverse scale parameter, :math:`{\beta_\omega}`, which are fit simultaneously with the tree topology, branch lengths, and other shared model parameters using maximum likelihood estimation.
We then infer selection at individual sites using an empirical Bayesian approach.

In the empirical Bayesian approach, we integrate the gamma distribution of omega values by approximating the distribution with *J* discrete categories, with each category having equal proportion.
Integrating the distribution is much faster than fitting the distribution, so typically *J* is set to be greater than *K* to save time.
Then, for each discrete category, *j*, we assign the mean value of its subdistribution, denoted as :math:`\omega_j`, to that category.
This analysis is indicated as ``--omega_random_effects_likelihood`` in the ``phydms`` options.
Given an integer greater than one, the ``--empirical_bayes`` option specifies the number of discrete categories used to approximate the gamma distribution for integration, denoted as *J*.

We do not know *a priori* which discrete category a site belongs to, so the likelihood function for observing a site's sequence data, :math:`\mathcal{S}_r`, is given by the average over all possibilities, i.e.,

.. math::
:label: rel_likelihood_function

\mathcal{L}(\mathcal{S}_r) = \frac{1}{J}\sum_{j = 0}^{J - 1} \mathcal{L}(\mathcal{S}_r | \omega_j)

where :math:`\mathcal{L}(\mathcal{S}_r | \omega_j)` is the likelihood function for observing sequence data :math:`\mathcal{S}_r` given that site *r* is in category *j*.

Then, the posterior probability that a site, *r*, with sequence data, :math:`\mathcal{S}_r`, belongs to category, *j*, is given by

.. math::
:label: rel_posterior_probability

\text{Pr}(\omega_j | \mathcal{S}r) = \frac{\frac{1}{J}\mathcal{L}(\mathcal{S}_r | \omega_j)}{\mathcal{L}(\mathcal{S}_r)} = \frac{\mathcal{L}(\mathcal{S}_r | \omega_j)}{\sum_{i=0}^{J - 1}\mathcal{L}(\mathcal{S}_r | \omega_i)}.

The category *j* which maximizes the posterior probability of observing :math:`\omega_j` given sequence data, :math:`\mathcal{S}_r`, is the most likely category for site, *r*. We calculate the posterior probability of diversifying selection at individual sites by summing the posterior probabilities over which that site belongs to any category, *j*, where :math:`\omega_j > 1`, i.e.,

.. math::
:label: rel_diversifying_selection

\text{Pr}(\omega_r > 1) = \sum_{j: \omega_j > 1}\text{Pr}(\omega_j | \mathcal{S}_r)

Identifying differentially selected amino acids by fitting preferences for each site
---------------------------------------------------------------------------------------
A more complete approach is to examine each site to see the extent to which the preferences for each amino acid in nature differ from those encapsulated in the :math:`\pi_{r,a}` values. The advantage of this approach is that it can identify any form of differential selection (the approach in the previous section works best when the selection in nature is more uniform across amino acids than the :math:`\pi_{r,a}` values), and also that it can pinpoint specific amino acids that are favored or disfavored in natural evolution by an unexpected amount. The disadvantage is that ``phydms`` does not currently implement a good way to statistically test the significance of this type of differential selection, so although you can visualize and assess the selection it's hard to say that any given differential selection is significant at some specific P-value threshold.
Expand Down
2 changes: 1 addition & 1 deletion docs/implementation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -928,7 +928,7 @@ The models described above fit a single value to each model parameter.
We can also fit a distribution of values across sites for one model parameter :math:`\lambda`.
For instance, when :math:`\lambda` is the :math:`\omega` of the *YNGKP* models, we get the *YNGKP_M5* model described in `Yang, Nielsen, Goldman, and Krabbe Pederson, Genetics, 155:431-449`_.

Specifically, let the :math:`\lambda` values be drawn from :math:`K` discrete categories with lambda values :math:`\lambda_0, \lambda_2, \ldots, \lambda_{K-1}`, and give equal weight to each category. Then the overall likelihood at site :math:`r` is
Specifically, let the :math:`\lambda` values be drawn from :math:`K` discrete categories with lambda values :math:`\lambda_0, \lambda_1, \ldots, \lambda_{K-1}`, and give equal weight to each category. Then the overall likelihood at site :math:`r` is

.. math::

Expand Down
2 changes: 1 addition & 1 deletion docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Installation

Minimal requirements
----------------------
`phydms`_ is written in `Python`_. It requires Python 3.5 or higher.
`phydms`_ is written in `Python`_. It requires Python 3.6 or higher.

Straightforward installation requires the `Python`_ package management system `pip`_ and a ``C`` compiler such a ``gcc`` (there are some ``cython`` extensions).

Expand Down
59 changes: 56 additions & 3 deletions docs/phydms_prog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,10 +79,11 @@ Command-line usage
This option is not typically recommended. It will typically lead to only very slight improvements in log likelihood at substantial computational cost.

\-\-omegabysite
If using a YNGKP model, then the :math:`\omega_r` value is nearly analogous that obtained using the *FEL* model described by `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_. If using and *ExpCM*, then :math:`\omega_r` has the meaning described in :ref:`ExpCM`. Essentially, we fix all other model / tree parameters and then compare a model that fits a synonymous and nonsynonymous rate to each site to a null model that only fits a synonymous rate; there is evidence for :math:`\omega_r \ne 1` if fitting both nonsynonymous and synonymous rate gives sufficiently better likelihood than fitting synonymous rate alone. See also the ``--omegabysite_fixsyn`` option.
If using a YNGKP model, then the :math:`\omega_r` value is nearly analogous that obtained using the *FEL* model described by `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_. If using an *ExpCM*, then :math:`\omega_r` has the meaning described in :ref:`ExpCM`. Essentially, we fix all other model / tree parameters and then compare a model that fits a synonymous and nonsynonymous rate to each site to a null model that only fits a synonymous rate; there is evidence for :math:`\omega_r \ne 1` if fitting both nonsynonymous and synonymous rate gives sufficiently better likelihood than fitting synonymous rate alone. See also the ``--omegabysite_fixsyn`` option.
For an alternative method to determine site-specific :math:`\omega_r`, please see the ``--omega_random_effects_likelihood`` option.

\-\-omegabysite_fixsyn
This option is meaningful only if you are using ``--omegabysite``. If you use this option, then we compare a model in which we fit a nonsynonymous rate to each site to a model in which we fit nothing. The synonymous rate is not fit, and so is assumed to be equal to the overall value fit for the tree. According to `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_, in some cases this can yield greater power if there is relatively limited data. However, it comes with the risk of giving spurious results if there is substantial variation in the synonymous substitution rate among sites.
This option is meaningful only if you are using ``--omegabysite``. If you use this option, then we compare a model in which we fit a nonsynonymous rate to each site to a model in which we fit nothing. The synonymous rate is not fit, and so is assumed to be equal to the overall value fit for the tree. According to `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_, in some cases this can yield greater power if there is relatively limited data. However, it comes with the risk of giving spurious results if there is substantial variation in the synonymous substitution rate among sites. This distribution is then partitioned into several discrete categories

\-\-diffprefsbysite
This option can only be used with *ExpCM* models, **not** with *YNGKP* models.
Expand Down Expand Up @@ -114,6 +115,25 @@ Command-line usage
This option computes an average of each preference across sites (:math:`\pi_a = \frac{1}{L} \sum_r \pi_{r,a}` where :math:`r = 1, \ldots, L`), and then uses these average preferences for all sites.
This can be used as a control, as it merges all the information in the preferences into a non-site-specific model.

\-\-omega_random_effects_likelihood
If using a YNGKP model, then the :math:`\omega_r` value is nearly analogous that obtained using the *REL* model described by `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_.
If using an *ExpCM*, then :math:`\omega_r` has the meaning described in :ref:`ExpCM`.
We compute the posterior probability that :math:`\omega_r \ne 1`, e.g., :math:`\omega_r > 1` or :math:`\omega_r < 1` given a distribution of :math:`\omega` across the gene.
For an alternative method to determine site-specific :math:`\omega_r`, please see the ``--omegabysite`` option.

This option requires a gamma-distributed :math:`\omega`.
For *ExpCM*, use the ``--gammaomega`` option.
For *YNGKP* models, use the *YNGKP_M5* model.
To control the number of categories used to compute the posterior probability, see the ``--REL_ncats`` option.

\-\-REL_ncats
More categories leads to slightly longer run-time, values of 50-100 are usually adequate.

Note that while the ``--ncats`` and ``--REL_ncats`` have a similar definition, the number of categories used to discretize a distribution, they are slightly different in practice.
``--ncats`` controls the discretization while the distribution is being fit.
``--REL_ncats`` controls the discretization of the fit distribution while calculating the posterior.
The calculation of the posterior is much more computationally efficient, so we recommend that ``--ncats`` :math:`<<` ``--REL_ncats``.

\-\-minbrlen
All branches with lengths less than this value will be set to this value in the initial starting tree.
Branches can still end up with lengths less than this after subsequent optimization of this starting tree.
Expand Down Expand Up @@ -240,9 +260,42 @@ Here is an example of the first few lines of a file. The entries are tab separat
127 -0.0088 -0.0006 -0.0010 0.1423 -0.0021 -0.0179 -0.0059 -0.0096 -0.0208 -0.0100 -0.0021 -0.0095 -0.0007 -0.0066 -0.0073 -0.0114 -0.0146 -0.0075 -0.0010 -0.0049 0.1423
289 -0.0079 -0.0127 -0.0005 -0.0002 -0.0228 -0.0005 -0.0154 -0.0156 -0.0033 -0.0167 -0.0113 -0.0034 -0.0004 -0.0004 -0.0094 -0.0020 -0.0028 -0.0133 -0.0006 0.1391 0.1391

The first column gives the site numbers, subsequent columns give the differential preference (:math:`\Delta\pi_{r,a}`) for each amino acid.
The first column gives the site number, subsequent columns give the differential preference (:math:`\Delta\pi_{r,a}`) for each amino acid.
The last column gives the half absolute sum of the differential preferences, :math:`\sum_a |\Delta\pi_{r,a}|`, at each site. This quantity can range from zero to one.
The sites are sorted with the highest half absolute sum differential preference first.

Gamma-distributed discrete category file
+++++++++++++++++++++++++++++++++++++++++++
This file has the suffix ``_omegabycategory.csv``, and is created only if using the ``--omega_random_effects_likelihood`` option.
This file gives the posterior probability of each site falling into each category, as well as the mean :math:`omega` value of each discretized category.
These posterior probabilities are computed nearly identically to those obtained using the *REL* model as described in `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_.

Here is an example of the first few lines of a file. The entries are comma separated::

site,post_probability,omega
1,0.2503826180447997,0.0695219697627359
2,0.24755166505269052,0.0695219697627359
3,0.2526024760622074,0.0695219697627359
4,0.2530711698554593,0.0695219697627359
5,0.24843828974534077,0.0695219697627359

The ``post_probability`` column gives the posterior probability of that site falling into a given category.
The sites and omega values are sorted in ascending numerical order.

Site-specific posterior probability file
+++++++++++++++++++++++++++++++++++++++++++
This file has the suffix ``_posteriorprobabilities.csv``, and is created only if using the ``--omega_random_effects_likelihood`` option.
This file gives the sum total probability of each site having an :math:`\omega_r > 1`.
These posterior probabilities are computed nearly identically to those obtained using the *REL* model as described in `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_.

Here is an example of the first few lines of a file. The entries are comma separate::

site,pr(omega > 1)
8,0.2541928826887663
2,0.2533289672072823
6,0.252851860574337
9,0.25243889606707554

The pr(omega > 1) gives the sum total posterior probability of the given site being under diversifying selection.

.. include:: weblinks.txt
1 change: 1 addition & 0 deletions docs/weblinks.txt
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
.. _`McCandlish and Stoltzfus, Quarterly Review of Biology, 89:225-252`: http://www.ncbi.nlm.nih.gov/pubmed/25195318
.. _`HKY85`: https://dx.doi.org/10.1007%2FBF02101694
.. _`Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`: http://mbe.oxfordjournals.org/content/22/5/1208.full
.. _`Nielsen and Yang, Genetics, 148; 929-936`: https://www.genetics.org/content/148/3/929
.. _`AIC`: https://en.wikipedia.org/wiki/Akaike_information_criterion
.. _`reStructuredText`: http://docutils.sourceforge.net/rst.html
.. _`Julien Dutheil`: http://kimura.univ-montp2.fr/jdutheil/CMS/index.php/
Expand Down
2 changes: 1 addition & 1 deletion phydmslib/_metadata.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = '2.3.8'
__version__ = '2.4.0'
__author__ = 'the Bloom lab (see https://github.com/jbloomlab/phydms/contributors)'
__url__ = 'http://jbloomlab.github.io/phydms'
__author_email__ = 'jbloom@fredhutch.org'
Expand Down
6 changes: 5 additions & 1 deletion phydmslib/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@
`CODON_NT_COUNT` (`numpy.ndarray` of int, shape `(N_NT, N_CODON)`)
Element `[w][x]` gives the number of occurrences of nucleotide
`w` in codon `x`.
`CODONSTR_TO_AASTR` (dict):
mapping of codons to amino acids.
`STOP_CODON_TO_NT_INDICES` (`numpy.ndarray` of float, shape `(N_STOP, 3, N_NT)`)
Element `[x][p][w]` is 1.0 if codon position `p` is nucleotide `w`
in stop codon `x` and 0.0 otherwise.
Expand Down Expand Up @@ -77,20 +79,22 @@
STOP_POSITIONS = numpy.ones((3, N_NT), dtype = 'float')
CODON_TO_INDEX = {}
INDEX_TO_CODON = {}
CODONSTR_TO_AASTR = {}
CODON_TO_AA = []
i = 0
for nt1 in sorted(NT_TO_INDEX.keys()):
for nt2 in sorted(NT_TO_INDEX.keys()):
for nt3 in sorted(NT_TO_INDEX.keys()):
codon = nt1 + nt2 + nt3
aa = str(Bio.Seq.Seq(codon).translate())
CODONSTR_TO_AASTR[codon] = aa
if aa != '*':
CODON_TO_INDEX[codon] = i
INDEX_TO_CODON[i] = codon
CODON_TO_AA.append(AA_TO_INDEX[aa])
i += 1
else:
STOP_CODON_TO_NT_INDICES.append(numpy.zeros((3, N_NT),
STOP_CODON_TO_NT_INDICES.append(scipy.zeros((3, N_NT),
dtype='float'))
STOP_CODON_TO_NT_INDICES[-1][0][NT_TO_INDEX[nt1]] = 1.0
STOP_CODON_TO_NT_INDICES[-1][1][NT_TO_INDEX[nt2]] = 1.0
Expand Down
Loading