Lebanonian Cedar SVETKA, a program for analysis of different alignments
Left arrow Back to the help page To the main page Right arrow
Right arrow About this project

Introduction

Nowadays, there is a huge number of known sequences of proteins and nucleic acids. One of the main task in understanding this "universe" is extracting evolutionary related families of proteins, protein domains, RNAs etc.; and a wide-used tool in studying such families is creating and analysis of multiple alignments of protein and nucleic acid sequences.
In our opinion there exists a lack of software for an analysis of multiple alignments. The widest-used sort of such software is phylogeny reconstruction programs. However, those programs do not give any information about relations between particular branches of a reconstructed tree, at one hand, and particular features of the sequences, at the other hand. Such a feature can look like "Leucine in the position 362 of the alignment of the entire family" and can, in many cases, be a "decision rule" to distinguish sequences from two sides of a tree branch. In such case we say that the position supports the branch.
The described service is an attempt to cover the mentioned gap. Receiving a multiple alignment to the input, the program detects conserved, variable, and diagnostic positions. Comparing the alignment with an input tree (the tree may be entered by the user or reconstructed with the WPGMA algorithm), the program detects supporting positions of the alignment for every branch the tree. Using the alignment, different characteristics of every branch are computed. We also include into the service a simple (maximum parsimony) algorithm for reconstruction of ancestral sequences, for every inner node of the tree.

A piece of history

This work began in a rainy October of the year 2003 - the year when All Russia celebrated the 300-th anniversary of the Northern Palmira - the City of St. Petersburg.
From the light hand of Sergei A. Spirin and Andrew. V. Alexeevski one of the problems of multiple sequence alignments - the problem of classification of positions - began to being solved.
During this research a program, calculating average distances between sequences un such alignments, was written. Moreover, correlations between the whole alignment and every of its positions, were calculated. We decided, that every alignment has conservative, diagnostic and variable positions. Then, we tried to define, if correlation coefficient can surely sign, if the considered position is diagnostic or variable one. But either the idea was bad, or the material wasn't enough - we couldn't derive such threshold of correlation coefficient, which could exactly separate diagnostic positions from variable ones.
Then, we tried to take into account some pecularities of alignments. Thus, there appeared a procedure of "weighting" sequences (weight of a sequence is some kind of measure the reliability of the information given by this sequence).
Later, comparing the alignments with phylogenetical trees began. It also gave some interesting results.
And when all this works were completed, it was evidently that this program can be interesting for researches. And this site appeared.

What does this program do

In fact, this program can only analyse alignments. But how it can do it!
Firstly, there can be used different substitution matrices. All the alignments, placed in the base, are analysed according to BLOSUM62 matrix. But if one wants to use some other matrices (e.g., PAM160) - "s'il vous plait", as it is spoken in France.
Then, alignments of two different formats - .fasta and .aln - are analysed. These formats are widley used all around the world. The files, placed in COGs database, have .aln format.
Then, lots of parameters are calculated. According to their meanings, classification of alignment positions is made. No one position is offended: all of them are attributed to one of 5 defined classes (or types). Look it on the example.
Phylogenetic tree is either given by the user or is built by the program. Method of construction of this tree - WPGMA (weighted pair-group method of analysis). Of course, this tree, like the best birches or oaks, has lots of branches (see example). Although all branch seem to be the same (or almost the same), they are rather different. It is necessary to separate strong branches (which divide alignment into 2 well-distinguished part) from weak ones (e.g, when sequences from one of branches are similar to those in the other part of alignment). It's worth mentioning that this program gives such separation, isn't it?
Also while analyzing alignment, the program obtains a dozen of supplementary results. Though they are not directly concerned with the purpose of this work, they can be useful in other researches. It's a trifle, but a pleasant one.

What does this site present

This site includes several pages.
Firstly, there is the general list of alignments, which contains information about all alignments in this base. For every alignment one can see: the number of all alignment positions and those which can be used for statistical analysis (the second and the third column accordingly). Then, a number of sequences in alignment (the fourth column) and quality of the alignment (how strongly does this alignment accord to the so called "rule of four sequences") is indicated.
For every alignment several html-pages are forseen. The first one contains information about phylogenetic tree, according to the alignment considered, and the alignment itself. Every position of alignment is coloured accordingly to its type. It is decided that positions can be conservative, diagnostic and variable. Also so called "gap" positions (those which are not presented in all sequences of alignment) are distinguished.
The second page (it is accessible from the first page) contains the full list of alignment positions, including such characteristics of every position as type, average distance and correlation between this position and whole alignment.
The third page (it is also accessible from the alignment page) contains information about tree branches. For every branch some measures are given - more detailed description of these measures is placed here. From this page you can learn, which branches are worthy of your attention.
At last, we tried to predict sequences in inner nodes of tree (i.e. possible ancestors). This prediction was made by parsimony algorithm. As this wasn't the main purpose of the work, this prediction could be not accurate enough. But some main peculiarities of different protein sequence sets are distinguished.
We hope, due to this information further analysis of proteins and protein alignments will be more full and grounded.

Arrow upstairs
Upstairs
When mistakes or interesting facts are found, please tell!