Structural alphabet

The goal of defining a structural alphabet is to code a 3D structure fragment of protein backbones and is to represent a 3D protein structure by a serial of structural alphabets. An alphabet represents pattern profiles of the backbone fragments (five residues long) derived from the pair database, therefore, a protein structure of L residues is described by a structural alphabet sequence of L-4 alphabets. We developed a nearest-neighbor clustering (NNC) algorithm to cluster 225523 3D-protein fragments into 23 groups, which are represented by respective structural alphabets. We found that these 23 structural alphabets can represent the profiles of most of the 3D fragments and be roughly divided into five categories: Helix alphabet (A, Y, B, C, and D), helix-like alphabet (G, I, and L), strand alphabet (E, F, and H), strand-like alphabet (K and N), and others. The 3D sharps of representation segments in the same category are similar. For example, the sharps of 3D segments in the helix alphabets are similar and the ones of strand alphabets are also similar. These 3D-fragment sharps and structural alphabets are shown in the following Figure.