Computational Methods for Protein Structure Prediction & Modeling V1 - Xu Xu and Liang
Preface
An ultimate goal of modern biology is to understand how the genetic blueprint of
cells (genotype) determines the structure, function, and behavior of a living organism
(phenotype). At the center of this scientific endeavor is characterizing the biochemical
and cellular roles of proteins, the working molecules of the machinery of life. A
key to understanding of functional proteins is the knowledge of their folded structures
in a cell, as the structures provide the basis for studying proteins’ functions
and functional mechanisms at the molecular level.
Researchers working on structure determination have traditionally selected individual
proteins due to their functional importance in a biological process or pathway
of particular interest. Major research organizations often have their own protein
X-ray crystallographic or/and nuclear magnetic resonance facilities for structure determination,
which have been conducted at a rate of a few to dozens of structures a
year. Realizing the widening gap between the rates of protein identification (through
DNA sequencing and identification of potential genes through bioinformatics analysis)
and the determination of protein structures, a number of large scientific initiatives
have been launched in the past few years by government funding agencies in
the United States, Europe, and Japan, with the intention to solve protein structures
en masse, an effort called structural genomics. A number of structural genomics
centers (factory-like facilities) have been established that promise to produce solved
protein structures in a similar fashion to DNA sequencing. These efforts as well as
the growth in the size of the community and the substantive increases in the ease
of structure determination, powered with a new generation of technologies such as
synchrotron radiation sources and high-resolution NMR, have accelerated the rate
of protein structure determination over the past decade. As of January 2006, the
protein structure database PDB contained ∼34,500 protein structures.
The role of structure for biological sciences and research has grown considerably
since the advent of systems biology and the increased emphasis on understanding
molecular mechanisms from basic biology to clinical medicine. Just as every
geneticist or cell biologist needed in the 1990s to obtain the sequence of the gene
whose product or function they were studying, increasingly, those biologists will
need to know the structure of the gene product for their research programs in this
century. One can anticipate that the rate of structure determination will continue to
grow. However, the large expenses and technical details of structure determination
mean that it will remain difficult to obtain experimental structures for more than a
small fraction of the proteins of interest to biologists. In contrast, DNA sequence
determination has doubled routinely in output for a couple of decades. The genome
projects have led to the production of 100 gigabytes of DNA data in Genbank, and
as the cost of sequencing continues to drop and the rate continues to accelerate, the
scientific community anticipates a day when every individual has the genes of their
interest and the genomes of all related major organisms sequenced.
Structure determination of proteins began before nucleic acids could be sequenced,
which nowappears almost ironic. As microchemistry technologies continue
to mature, ever more powerful DNA sequencing instruments and new methods for
preparation of suitable quantities of DNA and cheaper, higher sequencing throughput,
while enabling a revolution in the biological and biomedical sciences, also left
structure determination way behind. As sequencing capacity matured in the last few
decades of the twentieth century, DNA sequences exceeded protein structures by
10-fold, then 100-fold, and now there is a 1000-fold difference between the number
of genes in Genbank and the number of structures in the PDB. The order of magnitude
difference is about to jump again, in the era of metagenomics, as the analyses of
communities of largely unculturable organisms in their natural states come to dominate
sequence production. The J. Craig Venter Institute’s Sargasso Sea experiment
and other early metagenomics experiments at least doubled the number of known
open reading frames (ORFs) and potential genes, but the more recent ocean voyage
data (or GOS) multipled the number on the order of another 10-fold, probably more.
The rate of discovery of novel genes and correspondingly novel proteins has not
leveled off, since nearly half of new microbial genomes turn out to be novel. Furthermore,
in the metagenomics data, new families of proteins are discovered directly
proportional to the rate of gene (ORF) discovery.
Download
*
Preface
An ultimate goal of modern biology is to understand how the genetic blueprint of
cells (genotype) determines the structure, function, and behavior of a living organism
(phenotype). At the center of this scientific endeavor is characterizing the biochemical
and cellular roles of proteins, the working molecules of the machinery of life. A
key to understanding of functional proteins is the knowledge of their folded structures
in a cell, as the structures provide the basis for studying proteins’ functions
and functional mechanisms at the molecular level.
Researchers working on structure determination have traditionally selected individual
proteins due to their functional importance in a biological process or pathway
of particular interest. Major research organizations often have their own protein
X-ray crystallographic or/and nuclear magnetic resonance facilities for structure determination,
which have been conducted at a rate of a few to dozens of structures a
year. Realizing the widening gap between the rates of protein identification (through
DNA sequencing and identification of potential genes through bioinformatics analysis)
and the determination of protein structures, a number of large scientific initiatives
have been launched in the past few years by government funding agencies in
the United States, Europe, and Japan, with the intention to solve protein structures
en masse, an effort called structural genomics. A number of structural genomics
centers (factory-like facilities) have been established that promise to produce solved
protein structures in a similar fashion to DNA sequencing. These efforts as well as
the growth in the size of the community and the substantive increases in the ease
of structure determination, powered with a new generation of technologies such as
synchrotron radiation sources and high-resolution NMR, have accelerated the rate
of protein structure determination over the past decade. As of January 2006, the
protein structure database PDB contained ∼34,500 protein structures.
The role of structure for biological sciences and research has grown considerably
since the advent of systems biology and the increased emphasis on understanding
molecular mechanisms from basic biology to clinical medicine. Just as every
geneticist or cell biologist needed in the 1990s to obtain the sequence of the gene
whose product or function they were studying, increasingly, those biologists will
need to know the structure of the gene product for their research programs in this
century. One can anticipate that the rate of structure determination will continue to
grow. However, the large expenses and technical details of structure determination
mean that it will remain difficult to obtain experimental structures for more than a
small fraction of the proteins of interest to biologists. In contrast, DNA sequence
determination has doubled routinely in output for a couple of decades. The genome
projects have led to the production of 100 gigabytes of DNA data in Genbank, and
as the cost of sequencing continues to drop and the rate continues to accelerate, the
scientific community anticipates a day when every individual has the genes of their
interest and the genomes of all related major organisms sequenced.
Structure determination of proteins began before nucleic acids could be sequenced,
which nowappears almost ironic. As microchemistry technologies continue
to mature, ever more powerful DNA sequencing instruments and new methods for
preparation of suitable quantities of DNA and cheaper, higher sequencing throughput,
while enabling a revolution in the biological and biomedical sciences, also left
structure determination way behind. As sequencing capacity matured in the last few
decades of the twentieth century, DNA sequences exceeded protein structures by
10-fold, then 100-fold, and now there is a 1000-fold difference between the number
of genes in Genbank and the number of structures in the PDB. The order of magnitude
difference is about to jump again, in the era of metagenomics, as the analyses of
communities of largely unculturable organisms in their natural states come to dominate
sequence production. The J. Craig Venter Institute’s Sargasso Sea experiment
and other early metagenomics experiments at least doubled the number of known
open reading frames (ORFs) and potential genes, but the more recent ocean voyage
data (or GOS) multipled the number on the order of another 10-fold, probably more.
The rate of discovery of novel genes and correspondingly novel proteins has not
leveled off, since nearly half of new microbial genomes turn out to be novel. Furthermore,
in the metagenomics data, new families of proteins are discovered directly
proportional to the rate of gene (ORF) discovery.
Download
*