Nowadays, there is a huge number of known sequences of proteins and
nucleic acids. One of the main task in understanding this "universe" is
extracting evolutionary related families of proteins, protein domains,
RNAs etc.; and a wide-used tool in studying such families is creating and
analysis of multiple alignments of protein and nucleic acid sequences.
In our opinion there exists a lack of software for an analysis of multiple
alignments. The widest-used sort of such software is phylogeny
reconstruction programs. However, those programs do not give any
information about relations between particular branches of a reconstructed
tree, at one hand, and particular features of the sequences, at the other
hand. Such a feature can look like "Leucine in the position 362 of the
alignment of the entire family" and can, in many cases, be a "decision
rule" to distinguish sequences from two sides of a tree branch. In such
case we say that the position supports the branch.
The described service is an attempt to cover the mentioned gap. Receiving
a multiple alignment to the input, the program detects conserved,
variable, and diagnostic positions. Comparing the alignment with an input
tree (the tree may be entered by the user or reconstructed with the WPGMA
algorithm), the program detects supporting positions of the alignment for
every branch the tree. Using the alignment, different characteristics of
every branch are computed. We also include into the service a simple
(maximum parsimony) algorithm for reconstruction of ancestral sequences,
for every inner node of the tree.
A piece of history
This work began in a rainy October of the year 2003 - the year
when All Russia celebrated the 300-th anniversary of
the Northern Palmira - the City of St. Petersburg.
From the light hand of Sergei A.
Spirin and Andrew. V. Alexeevski
one of the problems of multiple sequence alignments - the
problem of classification of positions - began to being solved.
During this research a program, calculating average distances between
sequences un such alignments, was written. Moreover, correlations
between the whole alignment and every of its positions, were
calculated. We decided, that every
alignment has conservative, diagnostic and variable positions.
Then, we tried to define, if correlation coefficient can surely
sign, if the considered position is diagnostic or variable one.
But either the idea was bad, or the material wasn't enough -
we couldn't derive such threshold of correlation coefficient, which
could exactly separate diagnostic positions from variable ones.
Then, we tried to take into account some pecularities of alignments.
Thus, there appeared a procedure of "weighting" sequences
(weight of a sequence is some kind of measure the reliability
of the information given by this sequence).
Later, comparing the alignments with phylogenetical trees began.
It also gave some interesting results.
And when all this works were completed, it was evidently that
this program can be interesting for researches. And this site
What does this program do
In fact, this program can only analyse alignments. But how it can
Firstly, there can be used different substitution matrices. All
the alignments, placed in the base, are
analysed according to BLOSUM62 matrix. But if one wants to
use some other matrices (e.g., PAM160) - "s'il vous plait",
as it is spoken in France.
Then, alignments of two different formats - .fasta and
.aln - are analysed. These formats are widley used all
around the world. The files, placed in
COGs database, have .aln format.
Then, lots of parameters are calculated. According to their
meanings, classification of alignment positions is made. No one
position is offended: all of them are attributed to one of 5
defined classes (or types). Look it on the
Phylogenetic tree is either given by the user or is built
by the program. Method of construction of this tree - WPGMA
(weighted pair-group method of analysis). Of course, this tree,
like the best birches or oaks, has lots of branches (see
example). Although all branch
seem to be the same (or almost the same), they are rather
different. It is necessary to separate strong branches (which
divide alignment into 2 well-distinguished part) from weak ones
(e.g, when sequences from one of branches are similar to those
in the other part of alignment). It's worth mentioning that
this program gives such separation, isn't it?
Also while analyzing alignment, the program obtains a dozen
of supplementary results. Though they are not directly concerned
with the purpose of this work, they can be useful in other
researches. It's a trifle, but a pleasant one.
What does this site present
This site includes several pages.
Firstly, there is the
general list of alignments, which contains information about
all alignments in this base. For every alignment one can see:
the number of all alignment positions and those which can be used for
statistical analysis (the second and the third column accordingly).
Then, a number of sequences in alignment (the fourth column)
and quality of the alignment (how strongly does this alignment
accord to the so called "rule of four
sequences") is indicated.
For every alignment several html-pages are forseen. The first one
contains information about phylogenetic tree, according to the
alignment considered, and the alignment itself. Every position
of alignment is coloured accordingly to its type. It is decided
that positions can be conservative, diagnostic and variable.
Also so called "gap" positions (those which are not presented in
all sequences of alignment) are distinguished.
The second page (it is accessible from the first page) contains
the full list of alignment positions, including such
characteristics of every position as type, average distance and correlation between this position
and whole alignment.
The third page (it is also accessible from the alignment page)
contains information about tree branches. For every branch some
measures are given - more detailed description of these
measures is placed here. From
this page you can learn, which branches are worthy of your
At last, we tried to predict sequences in inner nodes of tree
(i.e. possible ancestors). This prediction was made by parsimony algorithm. As this wasn't the
main purpose of the work, this prediction could be not accurate enough.
But some main peculiarities of different protein sequence sets are
We hope, due to this information further analysis of proteins
and protein alignments will be more full and grounded.