Overview of the algorithm

The list of 'non-polar groups' in proteins and nucleic acids.

An 'indivisible unit' of the hydrophobic interaction is an atomic group that includes one carbon or sulfur atom and covalently bound hydrogen atoms. In proteins and nucleic acids those groups are -CH3, -CH2-, -CH=, -SH, and -S- groups.

The program uses two variants of the list of non-polar groups. The extended list includes all such groups in protein and nucleic acids, the strict list includes those groups of extended one, whose main atom is covalently bound only with another carbon or sulfur atoms (and, thus, is not covalently bound with any polar atom).

The center of a non-polar group is the center of the carbon or sulfur atom. Hydrogen atoms are not considered in calculations.

The graph of interactions of non-polar groups in a 3D structure.

Each non-polar group corresponds to one vertex of the graph. Two vertices are connected by an edge (and are suppose to be in interaction) in the case if (a) the distance between the centers of the groups does not exceed the threshold value d; (b) the interaction of these groups is not prevented by any other (polar or non-polar) group. We say that a group C prevents the interaction between non-polar groups A and B, if the ball with the center C and the radius RC includes the intersection of the balls with the centers A and B and the radii RA and RB, respectively. In the program all radii are equal to 2.7 Å, which corresponds to the minimal possible distance from a carbon-generated non-polar group to the oxygen atom of a water molecule. The value of d is a user parameter. Maximal vreasonable value of d is 5.4 Å, which corresponds to the sum of RA and RB radii in the case of carbon non-polar groups (roughly speaking, the hydrophobic interaction occurs if a water molecule cannot be placed between two non-polar groups).

Edges of the graph reflect two sufficiently different types of relations between atomic groups: (i) 'fixed' groups (including covalently bound ones); (ii) groups that are nearby located due to the hydrophobic interactions. Two non-polar groups A and B are 'fixed', if the distance between them is fixed due to chemical bonds, or any other interaction with energy exceeding sufficiently the free energy of the hydrophobic interaction between the groups. The examples of fixed non-polar groups are two groups, the central atoms of which are covalently bound with some third atom, or any two groups of aromatic ring. In the described algorithm and program these two different types of edges are not distinguished.

The graph of interactions of non-polar groups is denoted by Γ.

The definition of a (K,L)-cut of the graph Γ

The 1-neighborhood of a subgraph Δ of the graph Γ is the subgraph Δ' constructed from Δ by adding all edges that have at least one vertex belonging to Δ. The L-neighborhood is produced by L iterations of 1-neighborhood. A connected subgraph Δ with ≤K edges is called a (K, L)-cut, if the subgraph obtained by removing the edges of Δ from its L-neighborhood has two or more connected components (Fig. 1).

Fig. 1. Connected 2-edges subgraphs (thick lines) representing (a) (2, 1)-cut; (b) non-(2, 1)-cut. Vertices of 1-neighbourhood of the subgraphs are in dark grey. In (a), after the removal of thick edges from 1-neighbourhood the obtained subgraph is decomposed into 2 connected components. In (b), the analogous procedure gives a connected subgraph.

The algorithm of exhaustion of connected subgraphs of ≤ K edges

Assign indexes 1, 2, ... to all edges of Γ the n-th edge will be denoted by En. Let Δ be a connected subgraph of t<K edges, En be the edge of Δ with the maximal index n. The subgraph Δ' obtained by adding an edge Ex to Δ is called a 1-extension of Δ if (i) Ex belongs to the 1-neighbourhood of Δ; (ii) either x>n, or x<n and the subgraph obtained from Δ' by deletion of the edge En is not connected.

The algorithm starts with the exhaustion of all subgraphs of one edge. Then for each subgraph Δ all 1-extensions of Δ are considered. The procedure iterates K–1 times. It can be proved that each connected subgraph of ≤K edges is considered by the algorithm exactly ones.

The algorithm for the hydrophobic clusters detection

The suggested algorithm tries to solve the problem of the detection of hydrophobic clusters corresponding to spatial areas occupied exclusively or mainly by non-polar groups of atoms, in a given 3D structure of protein or macromolecular complex. It is rather difficult to give a strict definition of hydrophobic cluster, because of the complexity of configuration of non-polar groups in structures. Detection of hydrophobic clusters can be based on the fact that each non-polar group has many interactions inside a hydrophobic cluster, and at the same time between two hydrophobic clusters the interaction is weak. Detection of hydrophobic clusters in a planar variant is illustrated in Fig. 2.

Fig. 2. The planar illustration of (K,L)-cut algorithm. (a) The initial distribution of non-polar groups. (b) The graph of interactions of non-polar groups. (c) All (1,1)-cuts in the graph. The edges of (1,1)-cuts are thick ones. (d) The graph after removal of all (1,1)-cuts. An example of one (2,1)-cut is presented (thick edges). The clusters, found after the removal of (1,1)-cuts and one (2,1)-cut, are subscribed.

The algorithm realizes the above mentioned intuitive concept concerning a hydrophobic cluster as follows: (i) weak interactions between the hydrophobic clusters (yet not found) are searched, in terms of graph of interactions these weak interactions correspond to (K, L)-cuts; (ii) the hydrophobic clusters are detected as components obtained after the removal of weak interactions, in terms of graph these components are connected subgraphs. The parameters of the algorithm are the integers K, L used in the definition of (K, L)-cut, the list of non-polar groups, the threshold for distance between interacting non-polar groups d, and the minimal number m of non-polar groups in a found hydrophobic cluster.

The algorithm works as follows. At the first step the graph Γ of interactions of the non-polar groups is constructed (see Fig. 2b). At the second step all (K, L)-cuts in the graph Γ are found (Fig. 2c). If a subgraph with t<K edges is a (K, L)-cut, then larger subgraphs including this (K, L)-cut are not considered: the process of connected subgraphs exhaustion (see above) stops extending a subgraph if it is proved to be a (K, L)-cut. At the third step all edges of found (K, L)-cuts are removed from the graph (Fig. 2d). At the fourth step the connected components of the obtained graph are found. The vertices of each connected component that contains not less than m non-polar groups form a hydrophobic cluster. For every hydrophobic cluster the additional parameters (see further) are calculated. The found hydrophobic clusters, ordered in accordance to the number of non-polar groups, form the output of the algorithm.

For graphs of interacting non-polar groups the algorithm is linear in the number N of atoms in a given structure, because the number of interactions of a non-polar group with other groups is obviously restricted. The constant in this linear function is rather big and depends on K exponentially. Fortunately, only small K=1, 2, 3 are expected to be sufficient for reasonable hydrophobic cluster detection.

In the program realization the values K=1, L=1, m=3 are fixed.

Parameters of hydrophobic clusters

The following parameters are calculated for each hydrophobic cluster.
(i) The number of non-polar groups in the cluster.
(ii) The mean number of interactions of one non-polar group in the cluster. This parameter indirectly characterizes the form of the cluster. More compact packing of hydrophobic groups, and the form of the cluster closer to an ideal ball are characterized by a higher value of mean number of interactions of a non-polar group.
(iii) Semiaxes of the ellipsoid of inertia of the cluster. In calculating the ellipsoid, the groups of the cluster are considered as material points in space, with unitary masses. Ratios of semiaxes reflex the spatial form of the cluster. For example, if all three semiaxes are approximately equal, then the form of the cluster can be assumed to be close to a ball.

Strict list

Residue Atom (in PDB notation)
ALA CB
ARG CB
ARG CG
ASN CB
ASP CB
CYS CB
CYS SG
GLU CB
GLU CG
GLN CB
GLN CG
HIS CB
HIS CG
HIS CD2
HIS CE1
ILE CB
ILE CG1
ILE CD1
ILE CG2
LEU CB
LEU CG
LEU CD1
LEU CD2
LYS CB
LYS CG
LYS CD
MET CB
MET CG
MET SD
MET CE
PHE CB
PHE CG
PHE CD2
PHE CE2
PHE CZ
PHE CE1
PHE CD1
PRO CB
PRO CG
THR CG2
TRP CB
TRP CG
TRP CD2
TRP CE3
TRP CZ3
TRP CH2
TRP CZ2
TRP CE2
TYR CB
TYR CG
TYR CD2
TYR CE2
TYR CE1
TYR CD1
VAL CB
VAL CG1
VAL CG2
N C2* N is for any DNA nucleotide
T C5M
C C5

Extended list

Residue Atom (in PDB notation)
X C X is for any amino acid residue
X CA
ALA CB
ARG CB
ARG CG
ARG CD
ARG CZ
ASN CB
ASN CG
ASN ND2
ASP CB
ASP CG
CYS CB
CYS SG
GLU CB
GLU CG
GLU CD
GLN CB
GLN CG
GLN CD
HIS CB
HIS CG
HIS CD2
HIS CE1
ILE CB
ILE CG1
ILE CD1
ILE CG2
LEU CB
LEU CG
LEU CD1
LEU CD2
LYS CB
LYS CG
LYS CD
LYS CE
MET CB
MET CG
MET SD
MET CE
PHE CB
PHE CG
PHE CD2
PHE CE2
PHE CZ
PHE CE1
PHE CD1
PRO CB
PRO CG
PRO CD
SER CB
THR CB
THR CG2
TRP CB
TRP CG
TRP CD2
TRP CE3
TRP CZ3
TRP CH2
TRP CZ2
TRP CE2
TRP CD1
TYR CB
TYR CG
TYR CD2
TYR CE2
TYR CZ
TYR CE1
TYR CD1
VAL CB
VAL CG1
VAL CG2
N C5* N is for any DNA nucleotide
N C4*
N C3*
N C2*
N C1*
T C2
T C4
T C5
T C5M
T C6
C C2
C C4
C C5
C C6
A C8
A C5
A C6
A C2
A C4
G C8
G C5
G C6
G C2
G C4