Alternate Title

Widespread occurrence of non-sequential alignments and the implication for protein structure classification


Valentin Ilyin


Slava S. Epstein, Kostia Bergman, Eric J. Stewart, Richard C. Deth

Date of Award


Date Accepted


Degree Grantor

Northeastern University

Degree Level


Degree Name

Doctor of Philosophy

Department or Academic Unit

College of Arts and Sciences. Department of Biology.


bioinformatics, non-sequential alignments, PDB, protein structure comparison, structure alignment, TOPOFIT

Subject Categories

Proteins - Classification - Databases


Biology | Databases and Information Systems


Until recently non-sequential alignments were poorly recognized and the significance of their existence remained unresolved. Here I present findings from a comprehensive study which reveals that 68% of all significant alignments between CATH domains are non-sequential in nature. Importantly, these specific relationships are ignored in hierarchical classification systems, such as CATH, which therefore provide an inaccurate view of shared protein space. Regardless of CATH being the most utilized structure-based protein classification system in the scientific community, the database discounts the presence of non-sequential alignments. These alignments occur when the polypeptide chains from two proteins do not align in the typical N to C-terminal direction, but instead display structural homology arising by sequence elements from different regions of the proteins being structurally equivalent. More specifically, if both chains where aligned by primary sequence, one chain would require reshuffling in order to arrange the shared regions sequentially with respect to the other protein chain. This result strengthens the emerging point of view that protein space is continuous and multidimensional, not discrete, separate entities as viewed by the CATH database.

As part of my thesis research, I led the design and implementation of three peer-reviewed biological databases, each of which have been shared with the scientific community. First, the Structural Exon database was developed, which was implemented to rectify the inability to compare exon/intron structure across different eukaryotic organisms. Despite the fact many homologous proteins in eukaryotes have lost significant sequence identity due to forces of evolution, frequently the exon/intron structure of these genes remains conserved, and the Structural Exon database was designed specifically to ascertain such conserved gene organization. Second, the Structure SNP database was developed, which was designed to provide combined, integrated reports about non-synonymous SNPs (nsSNP) and related metabolic pathway information, along with real-time visual comparative analysis of the modeled structures with nsSNPs.

The final and most complex database, TOPOFIT-DB, is currently one of the largest protein structure comparison databases in the world. The critically important feature of TOPOFIT-DB is a reduction in the total amount of stored data needed to represent the entire PDB. More specifically, the current number of comparisons needed to transverse the entire PDB in an all against all approach has been scaled down by 98%, reduced from 7X109 alignments to a manageable 1.5X106 alignments. The actual number of chains found in the PDB has been condensed by 85%, reduced from ~120,000 protein chains to 16,646 protein chains, a number suitable for conducting large scale analyses. This reduction in the number of protein chains is pivotal for two reasons: 1. In order to maintain a current database, a pipeline was required for circumventing redundancy found throughout the structures in the PDB. The current pipeline in TOPOFIT-DB significantly reduces the number of comparisons needed to add new chains to the database. 2. This reduction aided in the analysis of structures of unknown function, which played a central role in the in-silico analysis of a proposed novel tyrosine radical position in R2 of ribonucleotide reductase, and allowed a testable hypothesis to be put forward in regards to this protein. Finally, along with the presentation of TOPOFIT-DB, I present data confirming the accuracy of the algorithm used in the database as one of the preeminent methods available in protein structure comparison.

Document Type


Rights Holder

Chesley Leslin

Click button above to open, or right-click to save.