jbloomlab · jon-mah · Nov 27, 2018 · Nov 28, 2018 · Nov 28, 2018 · Nov 28, 2018
diff --git a/.travis.yml b/.travis.yml
@@ -25,7 +25,7 @@ script:
 branches:
   only:
     - master
-    - model_adequacy
+    - REL-names
 
 notifications:
   email:

diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,5 +1,10 @@
 Changelog
 ===========
+
+2.4.0
+------
+* Added REL method for detecting sites of positive selection.
+
 2.3.8
 ------
 * Fixed bug in `phydms_prepalignment` due to `mafft` shortening sequence names.

diff --git a/docs/ExpCM.rst b/docs/ExpCM.rst
@@ -5,7 +5,7 @@
 =======================================================
 
 .. contents::
-   :depth: 2
+   :depth: 3
 
 Overview
 -------------
@@ -143,9 +143,12 @@ In this case, :math:`\beta` is drawn from ``--ncats`` categories placed at the m
 Note that the mean :math:`\beta` value is then :math:`\alpha_{\beta} / \beta_{\beta}`.
 
 Identifying diversifying selection via site-specific :math:`\omega_r` values
-------------------------------------------------------------------------------
+-----------------------------------------------------------------------------------
 One type of interesting selection is *diversifying selection*, where there is continual pressure for amino-acid change. Such selection might be expected to occur at sites that are targeted by adaptive immunity or subjected to some other form of selection which constantly favors changes in the protein sequence. At such sites, we expect that the relative rate of nonsynonymous substitutions will be higher than suggested by the site-specific preferences :math:`\pi_{r,a}` due to this diversifying selection.
 
+FEL-like approach
+++++++++++++++++++
+
 To detect diversifying selection at specific sites within the framework of the *ExpCM* implemented in ``phydms``, we use an approach that is highly analogous the *FEL* (**f**\ixed **e**\ffects **l**\ikelihood) method described by `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_. Essentially, the tree topology, branch lengths, and all shared model parameters are fixed to their maximum-likelihood values optimized over the entire gene sequence. Then for each site :math:`r`, we fit a site-specific ratio of the rate of synonymous versus nonsynonymous substitutions while holding all holding all the other tree and model parameters constant. Effectively, this is fitting a different :math:`\omega_r` for each site, and so this analysis is indicated as ``--omegabysite`` in the ``phydms`` options.
 
 Specifically, after fixing all of the other parameters as described above, for each site :math:`r` we re-define Equation :eq:`Frxy_ExpCM` as
@@ -166,6 +169,44 @@ The null hypothesis is that :math:`\omega_r = 1`. We compute a P-value for rejec
 
 Significant support for a value of :math:`\omega_r > 1` can be taken as evidence for diversifying selection beyond that expected given the constraints encapsulated in the site-specific amino-acid preferences. Significant support for a value of :math:`\omega_r < 1` can be taken as evidence for selection against amino-acid change beyond that expected given the constraints encapsulated in the site-specific amino-acid preferences. Note, however, that if the site-specific preferences don't accurately describe the real constraints, you might get :math:`\omega_r \ne 1` simply because of this fact -- so you will want to examine if sites might be subject to selection that is better described by modulating the stringency parameter :math:`\beta` or by invoking differential preferences, as described below.
 
+REL-like approach
++++++++++++++++++++
+
+In addition to the site-specific FEL-based approach discussed above, we have also implemented an approach that is highly analogous to the REL (**r**\andom **e**\ffects **l**\ikelihood) method described by `Nielsen and Yang, Genetics, 148; 929-936`_.
+Rather than fitting the ratio of the rate of synonymous versus nonsynonymous substitutions for *each* site, the REL approach involves fitting a discretized gamma distribution of omega values across *all* sites.
+When fitting this gamma distribution of :math:`\omega`, we let :math:`\omega` values be drawn from *K* discrete categories, with each category given equal proportion.
+This gamma distribution is described by a shape parameter, :math:`\alpha_{\omega}` and an inverse scale parameter, :math:`{\beta_\omega}`, which are fit simultaneously with the tree topology, branch lengths, and other shared model parameters using maximum likelihood estimation.
+We then infer selection at individual sites using an empirical Bayesian approach.
+
+In the empirical Bayesian approach, we integrate the gamma distribution of omega values by approximating the distribution with *J* discrete categories, with each category having equal proportion.
+Integrating the distribution is much faster than fitting the distribution, so typically *J* is set to be greater than *K* to save time.
+Then, for each discrete category, *j*, we assign the mean value of its subdistribution, denoted as :math:`\omega_j`, to that category.
+This analysis is indicated as ``--omega_random_effects_likelihood`` in the ``phydms`` options.
+Given an integer greater than one, the ``--empirical_bayes`` option specifies the number of discrete categories used to approximate the gamma distribution for integration, denoted as *J*.
+
+We do not know *a priori* which discrete category a site belongs to, so the likelihood function for observing a site's sequence data, :math:`\mathcal{S}_r`, is given by the average over all possibilities, i.e.,
+
+.. math::
+   :label: rel_likelihood_function
+
+   \mathcal{L}(\mathcal{S}_r) = \frac{1}{J}\sum_{j = 0}^{J - 1} \mathcal{L}(\mathcal{S}_r | \omega_j)
+
+where :math:`\mathcal{L}(\mathcal{S}_r | \omega_j)` is the likelihood function for observing sequence data :math:`\mathcal{S}_r` given that site *r* is in category *j*.
+
+Then, the posterior probability that a site, *r*, with sequence data, :math:`\mathcal{S}_r`, belongs to category, *j*, is given by
+
+.. math::
+   :label: rel_posterior_probability
+
+   \text{Pr}(\omega_j | \mathcal{S}r) = \frac{\frac{1}{J}\mathcal{L}(\mathcal{S}_r | \omega_j)}{\mathcal{L}(\mathcal{S}_r)} = \frac{\mathcal{L}(\mathcal{S}_r | \omega_j)}{\sum_{i=0}^{J - 1}\mathcal{L}(\mathcal{S}_r | \omega_i)}.
+
+The category *j* which maximizes the posterior probability of observing :math:`\omega_j` given sequence data, :math:`\mathcal{S}_r`, is the most likely category for site, *r*. We calculate the posterior probability of diversifying selection at individual sites by summing the posterior probabilities over which that site belongs to any category, *j*, where :math:`\omega_j > 1`, i.e.,
+
+.. math::
+   :label: rel_diversifying_selection
+
+   \text{Pr}(\omega_r > 1) = \sum_{j: \omega_j > 1}\text{Pr}(\omega_j | \mathcal{S}_r)
+
 Identifying differentially selected amino acids by fitting preferences for each site
 ---------------------------------------------------------------------------------------
 A more complete approach is to examine each site to see the extent to which the preferences for each amino acid in nature differ from those encapsulated in the :math:`\pi_{r,a}` values. The advantage of this approach is that it can identify any form of differential selection (the approach in the previous section works best when the selection in nature is more uniform across amino acids than the :math:`\pi_{r,a}` values), and also that it can pinpoint specific amino acids that are favored or disfavored in natural evolution by an unexpected amount. The disadvantage is that ``phydms`` does not currently implement a good way to statistically test the significance of this type of differential selection, so although you can visualize and assess the selection it's hard to say that any given differential selection is significant at some specific P-value threshold.

diff --git a/docs/implementation.rst b/docs/implementation.rst
@@ -928,7 +928,7 @@ The models described above fit a single value to each model parameter.
 We can also fit a distribution of values across sites for one model parameter :math:`\lambda`.
 For instance, when :math:`\lambda` is the :math:`\omega` of the *YNGKP* models, we get the *YNGKP_M5* model described in `Yang, Nielsen, Goldman, and Krabbe Pederson, Genetics, 155:431-449`_.
 
-Specifically, let the :math:`\lambda` values be drawn from :math:`K` discrete categories with lambda values :math:`\lambda_0, \lambda_2, \ldots, \lambda_{K-1}`, and give equal weight to each category. Then the overall likelihood at site :math:`r` is
+Specifically, let the :math:`\lambda` values be drawn from :math:`K` discrete categories with lambda values :math:`\lambda_0, \lambda_1, \ldots, \lambda_{K-1}`, and give equal weight to each category. Then the overall likelihood at site :math:`r` is
 
 .. math::
 

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -8,7 +8,7 @@ Installation
 
 Minimal requirements
 ----------------------
-`phydms`_ is written in `Python`_. It requires Python 3.5 or higher.
+`phydms`_ is written in `Python`_. It requires Python 3.6 or higher.
 
 Straightforward installation requires the `Python`_ package management system `pip`_ and a ``C`` compiler such a ``gcc`` (there are some ``cython`` extensions). 
 

diff --git a/docs/phydms_prog.rst b/docs/phydms_prog.rst
@@ -79,10 +79,11 @@ Command-line usage
     This option is not typically recommended. It will typically lead to only very slight improvements in log likelihood at substantial computational cost.
 
    \-\-omegabysite
-    If using a YNGKP model, then the :math:`\omega_r` value is nearly analogous that obtained using the *FEL* model described by `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_. If using and *ExpCM*, then :math:`\omega_r` has the meaning described in :ref:`ExpCM`. Essentially, we fix all other model / tree parameters and then compare a model that fits a synonymous and nonsynonymous rate to each site to a null model that only fits a synonymous rate; there is evidence for :math:`\omega_r \ne 1` if fitting both nonsynonymous and synonymous rate gives sufficiently better likelihood than fitting synonymous rate alone. See also the ``--omegabysite_fixsyn`` option.
+    If using a YNGKP model, then the :math:`\omega_r` value is nearly analogous that obtained using the *FEL* model described by `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_. If using an *ExpCM*, then :math:`\omega_r` has the meaning described in :ref:`ExpCM`. Essentially, we fix all other model / tree parameters and then compare a model that fits a synonymous and nonsynonymous rate to each site to a null model that only fits a synonymous rate; there is evidence for :math:`\omega_r \ne 1` if fitting both nonsynonymous and synonymous rate gives sufficiently better likelihood than fitting synonymous rate alone. See also the ``--omegabysite_fixsyn`` option.
+    For an alternative method to determine site-specific :math:`\omega_r`, please see the ``--omega_random_effects_likelihood`` option.
 
    \-\-omegabysite_fixsyn
-    This option is meaningful only if you are using ``--omegabysite``. If you use this option, then we compare a model in which we fit a nonsynonymous rate to each site to a model in which we fit nothing. The synonymous rate is not fit, and so is assumed to be equal to the overall value fit for the tree. According to `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_, in some cases this can yield greater power if there is relatively limited data. However, it comes with the risk of giving spurious results if there is substantial variation in the synonymous substitution rate among sites.
+    This option is meaningful only if you are using ``--omegabysite``. If you use this option, then we compare a model in which we fit a nonsynonymous rate to each site to a model in which we fit nothing. The synonymous rate is not fit, and so is assumed to be equal to the overall value fit for the tree. According to `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_, in some cases this can yield greater power if there is relatively limited data. However, it comes with the risk of giving spurious results if there is substantial variation in the synonymous substitution rate among sites. This distribution is then partitioned into several discrete categories
 
    \-\-diffprefsbysite
     This option can only be used with *ExpCM* models, **not** with *YNGKP* models.
@@ -114,6 +115,25 @@ Command-line usage
     This option computes an average of each preference across sites (:math:`\pi_a = \frac{1}{L} \sum_r \pi_{r,a}` where :math:`r = 1, \ldots, L`), and then uses these average preferences for all sites.
     This can be used as a control, as it merges all the information in the preferences into a non-site-specific model.
 
+   \-\-omega_random_effects_likelihood
+    If using a YNGKP model, then the :math:`\omega_r` value is nearly analogous that obtained using the *REL* model described by `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_.
+    If using an *ExpCM*, then :math:`\omega_r` has the meaning described in :ref:`ExpCM`.
+    We compute the posterior probability that :math:`\omega_r \ne 1`, e.g., :math:`\omega_r > 1` or :math:`\omega_r < 1` given a distribution of :math:`\omega` across the gene.
+    For an alternative method to determine site-specific :math:`\omega_r`, please see the ``--omegabysite`` option.
+
+    This option requires a gamma-distributed :math:`\omega`.
+    For *ExpCM*, use the ``--gammaomega`` option.
+    For *YNGKP* models, use the *YNGKP_M5* model.
+    To control the number of categories used to compute the posterior probability, see the ``--REL_ncats`` option.
+
+   \-\-REL_ncats
+    More categories leads to slightly longer run-time, values of 50-100 are usually adequate.
+
+    Note that while the ``--ncats`` and ``--REL_ncats`` have a similar definition, the number of categories used to discretize a distribution, they are slightly different in practice.
+    ``--ncats`` controls the discretization while the distribution is being fit.
+    ``--REL_ncats`` controls the discretization of the fit distribution while calculating the posterior.
+    The calculation of the posterior is much more computationally efficient, so we recommend that ``--ncats`` :math:`<<` ``--REL_ncats``.
+
    \-\-minbrlen
     All branches with lengths less than this value will be set to this value in the initial starting tree.
     Branches can still end up with lengths less than this after subsequent optimization of this starting tree.
@@ -240,9 +260,42 @@ Here is an example of the first few lines of a file. The entries are tab separat
     127 -0.0088 -0.0006 -0.0010 0.1423  -0.0021 -0.0179 -0.0059 -0.0096 -0.0208 -0.0100 -0.0021 -0.0095 -0.0007 -0.0066 -0.0073 -0.0114 -0.0146 -0.0075 -0.0010 -0.0049 0.1423
     289 -0.0079 -0.0127 -0.0005 -0.0002 -0.0228 -0.0005 -0.0154 -0.0156 -0.0033 -0.0167 -0.0113 -0.0034 -0.0004 -0.0004 -0.0094 -0.0020 -0.0028 -0.0133 -0.0006 0.1391  0.1391
 
-The first column gives the site numbers, subsequent columns give the differential preference (:math:`\Delta\pi_{r,a}`) for each amino acid.
+The first column gives the site number, subsequent columns give the differential preference (:math:`\Delta\pi_{r,a}`) for each amino acid.
 The last column gives the half absolute sum of the differential preferences, :math:`\sum_a |\Delta\pi_{r,a}|`, at each site. This quantity can range from zero to one.
 The sites are sorted with the highest half absolute sum differential preference first.
 
+Gamma-distributed discrete category file
++++++++++++++++++++++++++++++++++++++++++++
+This file has the suffix ``_omegabycategory.csv``, and is created only if using the ``--omega_random_effects_likelihood`` option.
+This file gives the posterior probability of each site falling into each category, as well as the mean :math:`omega` value of each discretized category.
+These posterior probabilities are computed nearly identically to those obtained using the *REL* model as described in `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_.
+
+Here is an example of the first few lines of a file. The entries are comma separated::
+
+    site,post_probability,omega
+    1,0.2503826180447997,0.0695219697627359
+    2,0.24755166505269052,0.0695219697627359
+    3,0.2526024760622074,0.0695219697627359
+    4,0.2530711698554593,0.0695219697627359
+    5,0.24843828974534077,0.0695219697627359
+
+The ``post_probability`` column gives the posterior probability of that site falling into a given category.
+The sites  and omega values are sorted in ascending numerical order.
+
+Site-specific posterior probability file
++++++++++++++++++++++++++++++++++++++++++++
+This file has the suffix ``_posteriorprobabilities.csv``, and is created only if using the ``--omega_random_effects_likelihood`` option.
+This file gives the sum total probability of each site having an :math:`\omega_r > 1`.
+These posterior probabilities are computed nearly identically to those obtained using the *REL* model as described in `Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`_.
+
+Here is an example of the first few lines of a file. The entries are comma separate::
+
+    site,pr(omega > 1)
+    8,0.2541928826887663
+    2,0.2533289672072823
+    6,0.252851860574337
+    9,0.25243889606707554
+
+The pr(omega > 1) gives the sum total posterior probability of the given site being under diversifying selection.
 
 .. include:: weblinks.txt
diff --git a/docs/weblinks.txt b/docs/weblinks.txt
@@ -41,6 +41,7 @@
 .. _`McCandlish and Stoltzfus, Quarterly Review of Biology, 89:225-252`: http://www.ncbi.nlm.nih.gov/pubmed/25195318
 .. _`HKY85`: https://dx.doi.org/10.1007%2FBF02101694
 .. _`Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222`: http://mbe.oxfordjournals.org/content/22/5/1208.full
+.. _`Nielsen and Yang, Genetics, 148; 929-936`: https://www.genetics.org/content/148/3/929
 .. _`AIC`: https://en.wikipedia.org/wiki/Akaike_information_criterion
 .. _`reStructuredText`: http://docutils.sourceforge.net/rst.html
 .. _`Julien Dutheil`: http://kimura.univ-montp2.fr/jdutheil/CMS/index.php/

diff --git a/phydmslib/_metadata.py b/phydmslib/_metadata.py
@@ -1,4 +1,4 @@
-__version__ = '2.3.8'
+__version__ = '2.4.0'
 __author__ = 'the Bloom lab (see https://github.com/jbloomlab/phydms/contributors)'
 __url__ = 'http://jbloomlab.github.io/phydms'
 __author_email__ = 'jbloom@fredhutch.org'

diff --git a/phydmslib/constants.py b/phydmslib/constants.py
@@ -45,6 +45,8 @@
     `CODON_NT_COUNT` (`numpy.ndarray` of int, shape `(N_NT, N_CODON)`)
         Element `[w][x]` gives the number of occurrences of nucleotide
         `w` in codon `x`.
+    `CODONSTR_TO_AASTR` (dict):
+        mapping of codons to amino acids.
     `STOP_CODON_TO_NT_INDICES` (`numpy.ndarray` of float, shape `(N_STOP, 3, N_NT)`)
         Element `[x][p][w]` is 1.0 if codon position `p` is nucleotide `w`
         in stop codon `x` and 0.0 otherwise.
@@ -77,20 +79,22 @@
 STOP_POSITIONS = numpy.ones((3, N_NT), dtype = 'float')
 CODON_TO_INDEX = {}
 INDEX_TO_CODON = {}
+CODONSTR_TO_AASTR = {}
 CODON_TO_AA = []
 i = 0
 for nt1 in sorted(NT_TO_INDEX.keys()):
     for nt2 in sorted(NT_TO_INDEX.keys()):
         for nt3 in sorted(NT_TO_INDEX.keys()):
             codon = nt1 + nt2 + nt3
             aa = str(Bio.Seq.Seq(codon).translate())
+            CODONSTR_TO_AASTR[codon] = aa
             if aa != '*':
                 CODON_TO_INDEX[codon] = i
                 INDEX_TO_CODON[i] = codon
                 CODON_TO_AA.append(AA_TO_INDEX[aa])
                 i += 1
             else:
-                STOP_CODON_TO_NT_INDICES.append(numpy.zeros((3, N_NT), 
+                STOP_CODON_TO_NT_INDICES.append(scipy.zeros((3, N_NT),
                         dtype='float'))
                 STOP_CODON_TO_NT_INDICES[-1][0][NT_TO_INDEX[nt1]] = 1.0
                 STOP_CODON_TO_NT_INDICES[-1][1][NT_TO_INDEX[nt2]] = 1.0