PQ algorithm

Tree score

Let us have a multiple alignment of protein sequences and an unrooted binary phylogenetic tree with leaves labeled with the sequences of the alignment. We suppose that the number of sequences is greater than 3, so the tree is not trivial. Let us denote aic the letter (i.e., an amino acid residue or the gap symbol) at the c-th column of the i-th sequence of the alignment. For each four-element subset {i,j,k,l} of the sequences of the alignment there is exactly one division onto two two-element subsets that is supported by branches of the tree. We will always suppose that this division is {i,j}∪{k,l} that means there is at least one branch of the tree such that i and j i s at one side of this branch and k and l are at another side. Let us fix also an amino acid substitution matrix S(a,b), for example BLOSUM62.

Tree score Q is given by the following formula:
Q = \sum_c \sum_q Q_{cq}
where c runs over all columns of the alignment, q runs over all quartets {i, j, k, l} of sequences such that aic, ajc, akc, alc are all residues (not gaps), and Qcq (called a position-quartet score) is given by the following formula:
Q_c = \max(0, S(a_{ic},a_{jc}) - X_{cq}) + \max(0, S(a_{kc},a_{lc}) - X_{cq})
where
X_{cq} = \max (S(a_{ic},a_{kc}), S(a_{ic}, a_{lc}), S(a_{jc},a_{kc}), S_(a_{jc}, a_{lc}))

If the matrix S(a,b) is a diagonal one, with 1 at the main diagonal and 0 at other positions, the position-quartet score Qcq is equal to:

Note that the position-quartet score does not depend on branch lengths or tree root position. That is why the program outputs an unrooted tree without branch lenghts.

Search strategy

With a given alignment, a problem arises to find the tree with the highest score. An exact solution needs factorial time, so we used a number of standard heuristics to find a tree with a score close to the best one.

Single growing

Fix an order of sequences of the alignment. For the first four sequences, find a tree with the best score (that needs checking three trees only). Add the fifth sequences and find a best tree with five leaves such that its subtree with first four leaves is the best one found at the first step. Continue adding sequences in an analogous way until obtaining a tree corresponding to the entire set of sequences.

Multiple growing

Repeat single growing several times using different orders (random shuffling) of sequences. Among the obtained trees, choose one with the highest score.

NNI hill climbing

Take the result of a single or multiple growing and perform all possible nearest-neighbor interchanges (NNI), one by one. If the current NNI gives a tree with a greater score, take that tree and repeat the procedure. Perform until all NNIs of the current tree give trees with less scores than the score of the current tree.

NNI Monte Carlo search

Fix an initial temperature T (1000 by default). Take the result of a single or multiple growing and perform all possible NNI, one by one. If the current NNI gives a tree with a greater score, take that tree and repeat the procedure. If the current NNI gives a tree with a less score, take that tree with the probability P=exp(K(QoldQnew)/T), where K=12000000. After each step, reduce T by T/N, where N is a parameter, by default N=1000. Stop when T reaches zero. Output the tree with the highest score among all tested.

Substitution matrices and gaps

Four substitution matrices S(a,b) are available. One is DNAFULL.txt for DNA/RNA alignments. The default one is BLOSUM62. The so-called "Special" matrix (FungiMatrix.txt) is obtained from a number of alignments of fungal proteins with known phylogeny. In average at eukaryotic protein sets it gives slightly better results comparing with BLOSUM62. The fourth matrix is the diagonal one.

Attempts to score gaps lead to worse results in most cases. In the command-line version of the program it is possible to switch scoring gaps on.


To the main page