Print
Topic: Technology
Type: Articles
Rate this article
(no votes)
 (0 votes)
Share this article
Alexander Kanapin

PhD, Head of Computational Genomics, Department of Oncology, University of Oxford, Russia’s Representative in ELIXIR Consortium

In an alternative version of the history of mankind presented in the cult animated television series of the mid-1990s Neon Genesis Evangelion, earthlings will face another apocalypse in 2015, prepared by scientists involved in genomics, cloning and bioinformatics. The authors call it the Third Impact. But should we anticipate a third breakthrough that would fundamentally provide new knowledge in basic science and bring about new cures and treatments in medical practice?

In an alternative version of the history of mankind presented in the cult animated television series of the mid-1990s Shinseiki Evangelion (Neon Genesis Evangelion), earthlings will face another apocalypse in 2015, prepared by scientists involved in genomics, cloning and bioinformatics. The authors call it the Third Impact.

In the middle of the last century, Professor Igor Tamm, a distinguished physicist and a Nobel Prize winner, argued that the coming century would be the century of biology, just as the 20th century was the century of physics. In terms of public interest, this scholar was certainly right. But is it so in terms of breakthroughs in science? Analogous to the anime Evangelion, we can assert that the two “impacts” or, more precisely, breakthroughs have already taken place. But should we anticipate a third breakthrough that would fundamentally provide new knowledge in basic science and bring about new cures and treatments in medical practice?

Let’s consider one of the most rapidly developing branches of scientific knowledge, namely bioinformatics. Like many other interdisciplinary fields, it is pretty hard to precisely define it. For the purposes of this review, we understand the term bioinformatics as the use of information technology for the analysis of biological data.

Figures in biology: from Mendel's peas to personal genomes

Information technology ceased to be only the realm of programmers and mathematicians and help expand the scope of biological problems that could be solved by computers.

Historically, biology developed as a descriptive science. Thus, a substantial part of Charles Darwin’s work as a biologist during his expeditions was in the form of illustrations – pictures of different kinds of animals. Digitizing images gave a new impetus to the analysis of graphical information in biology. In a determinate sense, Gregor Mendel can be regarded as the first bioinformatician since he used quantitative data to solve a purely biological problem: by counting the number of peas with different phenotypes over a number of generations, he was able to formulate the laws of heredity.

Over time, more features appeared in biology that could be “counted”, and statistical genetics included studies of population dynamics, kinetics of biochemical reactions and other processes that occur in biological systems. Proverbially known “predator-prey” system model was one of the first examples of the use of mathematics to model biological processes. An essential component in this development was the use of numerical methods in biophysics for modeling the structure and dynamics of biopolymers – proteins and nucleic acids.

Such a calculus certainly required a significant computing power, but the issue of making bioinformatics a separate discipline was not brought up until the mid-1980s. The term “bioinformatics” appeared for the first time was in the title of the scientific article New Directions in Bioinformatics, published in 1989.

In the middle of the last century, Professor Igor
Tamm, a distinguished physicist and a Nobel
Prize winner, argued that the coming century
would be the century of biology, just as
the 20th century was the century of physics.

In the 1980s, two trends arose in the development of different sciences and technologies. First, the first sequences of proteins and nucleic acids became available on a large scale (in the hundreds of thousands). Secondly, personal computers emerged, allowing biologists to analyze the new data types that were practically impossible to process manually. Thus, the Swiss bioinformatician Amos Bairoch developed the molecular biology software package PC/GENE, which runs on a simple personal PC/XT computer with a 4 MHz processor and allows molecular biologists to do almost everything needs. Around the same time (1986), Amos Bairoch created the Swiss-Prot protein sequences database which continues to hold the status of the primary protein sequence resource in the world in terms of reliability.

Russian scientists and programmers kept abreast of the mainstream. Suffice it to recall some of the software packages that were created in the early 1990s and were in no way inferior to their foreign analogues in terms of functionality and innovation, namely GeneBee (the Moscow State University), VOSTORG (the Institute of Cytology and Genetics SB RAS), Samson (the Institute of Mathematical Problems of Biology, Pushchino) and others. Information technology ceased to be only the realm of programmers and mathematicians and help expand the scope of biological problems that could be solved by computers.

Thus, by the mid-1990s, bioinformatics has taken a worthy place among other branches of knowledge. Its importance was recognized by the international scientific community through the establishment of the “three pillars” of the bioinformatics world: the European Bioinformatics Institute (EBI) in 1992, the National Centre for Biotechnological Information (NCBI) in 1988 and the DNA Data Bank of Japan (DDBJ) in 1986. It should be noted that this “riot” growth of data mainly involved molecular biology (and continues to do so to a considerable degree). Biopolymers, storing genetic information (DNA), as well as those, acting as direct tools in biochemical reactions in the cell (proteins), can be represented as a sequence of symbols or a kind of letters in a given alphabet. This type of information is easily formalized for storing and processing in computer systems.

The volume of data in modern biological science is such that without the use of information technology, no analysis is possible. The coming era of personal genomic data, which allows for reading the genetic code of almost any human being, makes the role of bioinformatics even more significant.

On the immortality of TV sets, or on the science of publicity

Given that most diseases are caused not by one but by many mutated genes, the sanguine hope of quick success in the medical application of discoveries relating to the human genome looks like wishful thinking.

On June 26, 2000, Bill Clinton and Tony Blair announced the publication of the first version of the human genome at a joint press conference. This date has certainly become a landmark in the history of not only science but humanity as well. Many scientists took the news with caution, realizing that the read sequence of the genome with all its three billion letters of code that make up the complete genetic blueprint of humans, was only the beginning. Even the title of the published article on the genome research Initial Sequencing and Analysis of the Human Genome emphasized the preliminary nature of findings. The authors expressly declared that they were offering up the draft version of the human genome, the preliminary results of the data analysis. In contrast to scientists, politicians were much more optimistic. For example, President Clinton noted at the end of the official portion that soon we will live 150 years, and “our children's children will know the term cancer only as a constellation of stars”.

However, such optimistic expectations were followed by disillusionment 3-4 years later. Many scientists spoke of the “genomic bubble” similar to speculative bubbles. Most of them have admitted that it will take a long time to make practical use of this discovery in medicine, which requires not only patience, but also an understanding of what one can do with this data. The protein-coding segments make up only approximately 1 percent of the human genome. Therefore, expectations that gaining information about all proteins and their genes can radically improve the situation and lead to breakthroughs in the development of new drugs and treatments are somewhat premature.

Cystic fibrosis, a genetic disease caused by a mutation in one of the genes, encoding regulatory proteins, offers a classic example. The headline of an article published in Nature called One Gene, Twenty Years speaks for itself. It highlights the fact that even in such a seemingly simple case, when the specific mutated gene that causes the disease is known, we are still far from winning a victory over the disease. Given that most diseases are caused not by one but by many mutated genes, the sanguine hope of quick success in the medical application of discoveries relating to the human genome looks like wishful thinking.

circos.ca
Encyclopedia Of DNA Elements. ENCODE

Three years later (after the publication of the human genome), a public research consortium named ENCODE was launched, the Encyclopedia Of DNA Elements. The main goal of the Project was to develop a detailed description of all the genes and other genomic elements, creating a kind of map or encyclopedia. In 2012, preliminary mapping results were published in a large review article and in 29 additional articles on specific biological problems.

In fact, the Consortium participants tried to identify a particular segment of the human genome responsible for this or that function. Using experimental techniques and methods of bioinformatics, they discovered that about 80 per cent of the DNA sequence of the genome can be read, which allows for the synthesis of RNA on this basis (a process called transcription). If this segment of the chromosome cannot be read, it can be modified chemically to serve as a regulator, turning on or off the reading of certain portions of the genomic DNA. Having compared the genomes of different species of primates and mammals, the participants of this research also performed an evolutionary analysis and came to the conclusion that the same DNA segment in different animal species indicates that it changes little if at all over the process of evolution. Therefore, the function of this area is important for the functioning of the cell or the whole organism.

The ENCODE Project was certainly an important step in the study of the human genome. Having read the genome, scientists have recorded the letters of which it consists. Now the task is to understand what words are written with these letters and the meaning of phrases composed of them. However, biological systems are notable for an unparalleled complexity of structure and the organization of processes occurring in them, which eliminates the possibility of simple and fast solutions.

We should distinguish the so-called publicity science from real scientific knowledge, which leads to the practical application of discoveries.

The ENCODE data publication generated an intense discussion in the scientific community. One of the most ironic and widely discussed articles was entitled On the Immortality of Television Sets: “Function” in the Human Genome According to the Evolution-Free Gospel of ENCODE. The authors of the article reasonably argued that the ENCODE consortium publications blow functionality out of proportion, and that assigning function to 80 per cent of the genome is based on logically contradictory assumptions, which ignore the theory of molecular evolution and other fundamental biological postulates.

If the functionality of genomic segments is not protected by natural selection, they will accumulate deleterious mutations and will cease to be functional. The ENCODE authors however have adopted the absurd alternative that assumes that no deleterious mutations can ever occur in the regions they have deemed to be functional. Such an assumption is akin to claiming that a television set left on and unattended will still be in working condition after a million years because no natural events, such as rust, erosion, static electricity, and earthquakes, can affect it.

While recognizing the value of the Human Genome and ENCODE Projects, we should distinguish the so-called publicity science from real scientific knowledge, which leads to the practical application of discoveries.

Biological systems are incredibly complex, and for the time being we can only describe the laws of their functioning in simple terms and equations for relatively simple situations. It is not just the problem that computing power lags behind the pace at which new technologies produce biological data (e.g. next-gen sequencing). The problems that mathematicians face today are in many respects similar to those that physicists faced when the quantum paradigm replaced the classical one.

innovaworld.ru
USB sequencer

One of the most interesting works on the philosophy of science is entitled Mathematics Is Biology's Next Microscope, Only Better; Biology Is Mathematics' Next Physics, Only Better. The author, Professor Joel E. Cohen, describes ten major challenges facing biologists and mathematicians. In his view, the need to analyze and model complex biological systems, from cells to biotic communities, could lead to new theories and algorithms in mathematics and computer science. According to the scientist, the current situation is similar to that which existed in physics at the beginning of the 20th century and led to the creation of quantum mechanics and relativity theory.

The future is not far away: from genes to drugs

Therefore, we can claim that the two breakthroughs have already taken place, namely the Human Genome and ENCODE. What should we expect in the near future – a new breakthrough or stagnation?

The discovery of a relatively cheap method of sequencing DNA and RNA has become one of the most notable changes in the experimental molecular biology. Today, a full reading of the human genome can cost around 1000-1500 dollars and the price is declining steadily. Such progress is due to the emergence of Next-Generation Sequencing (NGS). The leading manufacturer of DNA sequencing machines is Illumina, and its main competitors are Pacific Biosciences and Life Technologies, taking advantage of the Ion Torrent technology. In most of these techniques, the genome is divided into short fragments (composed of several hundred letters-nucleotides) that are quickly read by some physicochemical method. Thus, the human genome is represented as a file with dozens of millions of short fragments.

The Third Impact in bioinformatics is associated by many with what is called Big Data.

However, new technologies are developing, and the Oxford Nanopore approach appears to be one of the most promising. This technology is now being tested and has already attracted the attention of not only scientists, but the media too. The technology allows one to perform a complete single-molecule sensing with the help of a USB-device not much bigger than a conventional memory stick. In theory, this technology allows for reading longer fragments of the genome (up to 100 thousand nucleotides), while the accuracy of the reading is so far only around 10 per cent. The miniaturized single-molecule analysis system performs only basescaling, while the main computational work is done in the cloud.

Reading the sequence of the genome as a set of short pieces has its limitations. So far no algorithm exists to assemble the complete sequence of the human genome from millions of short fragments that have been read by the sequencer. Therefore, the analysis method applied today is called resequencing. The latter implies that all fragments of the genome are compared with a certain base sequence, a kind of a canonical version of the genome collected from the genomes of multiple individuals. Of course, such a sequence will be just a generalization rather than the genuine genome of an individual. In turn, while resequencing, individual genome fragments are compared with the canonical version, which makes it possible to identify differences that may precisely characterize the individual or the group of people (e.g. suffering from a certain hereditary disease), if we compare a number of individual genomes.

DNA resequencing helps for understanding the genome structure. Other methods, including RNA-Seq, can reveal how this structure works. RNA-Seq technique helps to determine whether this or that gene is functioning in the cell and producing RNA, from which proteins are synthesized that performs a particular function. Chromatin Immuno Precipitation (ChIP) techniques provide an opportunity to understand how the work of genes is regulated, and the way in which a particular protein switches on or off the genes’ functioning, adjusting the work of a cell or an organism to changing external conditions.

microbialchronicles.com
Is the Bioinformatic Apocalypse Looming:
Big Data and Biosecurity

Sequencing is not the only way to obtain the biological data. The so-called biochips or DNA microarrays are widely used to this end too. This technology was first applied to the analysis of genomes in the laboratory of Academician A. Mirzabekov and has since been widely used as a fairly cheap alternative to the sequencing methods, especially in medical diagnostics. It makes it possible to quickly determine, whether the patient’s gene has a mutation, which will affect the success of the treatment with this or that drug.

Technologies that directly count the number of certain molecules in cells, such as Nanostring technology or Quantitative Proteomics for analyzing proteins with mass spectrometry, are worth noting too.

Drawings made their comeback in “big biology” too. Computerized image analysis is used not only in the obvious instances of diagnosing cancer through cell and tissue morphology, but in the study of the dynamics of biochemical reactions in the cells at the molecular level.

All these techniques generate huge amounts of data of various types. Of course, their processing, analysis and storage are not possible without computers and bioinformatics is of paramount importance in this process. On the one hand, it involves the development of new algorithms for analysis and integration of heterogeneous data. On the other hand, the purely engineering problems of organizing storage and access to data of various protection rates (e.g. clinical) need to be addressed. In other words, the Third Impact in bioinformatics is associated by many with what is called Big Data.

* * *

Data volumes are growing, but the understanding of what is behind them is, unfortunately, still very far from what is needed for practical use, especially in medicine.

One of the main challenges faced by bioinformatics during the progress in the field of genome sequencing techniques has been formulated by Professor George M. Church: $1000 to sequence the genome, and then $1M to interpret it. Data volumes are growing, but the understanding of what is behind them is, unfortunately, still very far from what is needed for practical use, especially in medicine. This is largely due to the nature of data produced by modern technologies. Genome sequencing provides a static picture. It can only describe the set of genetic constructions that an individual was born with. The functioning of these constructions throughout life, which of them will manifest themselves (and in what way) and which not at all, depends on a huge number of external factors.

Moreover, due to the fact that the de novo assembly of the complete human genome from sequenced fragments is not yet possible, the analysis of data is reduced mainly to the search for differences among people and explaining the function of a particular segment of the genome through these differences. Symptoms can be either neutral (eye color, blood type, average height) or pathological (this or that genetic disease). Since the differences in a particular gene (broadly speaking, in the segment of the genome) are associated with one or another external feature, they can reveal the former’s functional role. Certainly, Big Data here is of great importance. The more patients with a particular disease are analyzed, the more reliable the statistics will be, allowing for a more precise identification of the genetic markers related to this disease. No wonder that genetic diagnostics is undergoing now a new boom.

However, the introduction of a developed diagnostic technique in clinical practice takes a lot of time. The first clinical tests using the next-generation sequencing technology were officially approved by the US Food and Drug Administration (FDA) only a year ago. This is due to the fact that up to date genetic tests detect only the most obvious and easily explained features. The amount has not yet been transformed into quality, and an abundance of data on new areas of the genome associated with abnormal or normal features, does not provide sufficient grounds for a breakthrough in obtaining new biological knowledge. The situation is similar to that which occurred with the reading of the human genome: scientists just categorize symptoms, attributing different segments of the genome to them. This refers not only to the sequencing, but also to other methods for obtaining data, such as biochips, mass spectrometry, and image analysis. The accumulation of new information arrives in an avalanche, and much is already getting practical application.

The fact that biology has always developed from data storage to their generalization inspires optimism. We can only hope that a breakthrough is in the offing. And without bioinformaticians, these “accountants in biology,” it is virtually impossible.

(no votes)
 (0 votes)

Poll conducted

  1. In your opinion, what are the US long-term goals for Russia?
    U.S. wants to establish partnership relations with Russia on condition that it meets the U.S. requirements  
     33 (31%)
    U.S. wants to deter Russia’s military and political activity  
     30 (28%)
    U.S. wants to dissolve Russia  
     24 (22%)
    U.S. wants to establish alliance relations with Russia under the US conditions to rival China  
     21 (19%)
For business
For researchers
For students