Making Sense of Genomic Data: Where are we in the function annotation race?


Due to the increasing genome-sequencing initiatives worldwide and the cheaper associated costs, a huge amount of genomic data is now accumulating. In contrast, the functions of only around 1% of these sequences are currently known from experimental studies. The gap between unannotated (sequences whose function is not known) and annotated sequences will continue to rise further since the experimental functional characterisation of such large amounts of genomic data is not feasible. In order to bridge this widening gap, computational function prediction approaches will be essential.

The information encoded in the genome is translated into proteins which carry out the biological functions required for proper functioning of a cell. They are made up of a linear chain of amino acids (determined by the nucleotide sequence in genes) linked together by peptide bonds and they can be as diverse as the functions they serve. Depending on their amino-acid composition and sequence, proteins can fold into their native three-dimensional conformation, which allows them to interact with other proteins or molecules and perform their function. Proteins are often considered as the ‘workhorse’ molecules of the cell and they can perform diverse functions – as biological catalysts, structural elements, carrier molecules or roles in cell signalling and cellular metabolism amongst others. As a result, in order to have a better understanding the cell at the molecular level and ‘decode’ the available genomic data, it is essential to characterize protein functions.

The functional role of a protein can be studied or described in many different ways – by the molecular function or biochemical activity of the protein, its role in a biological process or its relatedness to a disease. Hence, the term ‘protein function’ can be very ambiguous unless the context in which the function of a protein is described is stated clearly. Protein function descriptions written in the natural language used in the literature have been found to be too vague and unspecific to accurately describe the function of proteins; this has led to the subsequent development of a common organized protein annotation vocabulary – the Gene Ontology. This is the largest and the most widely used resource of protein functions which can be used to assign functions to proteins using different contexts irrespective of the source organism.

The conventional method used to predict protein function is a protein sequence (or structure) homology search to identify similar sequences from a protein sequence (or structure) database, followed by extrapolating from the known functions of the most similar sequence (or structure). This is based on the principle that evolutionarily related proteins having high sequence (or structure) similarity have similar, if not identical functions. However, this approach is error prone in cases when the protein in question is substantially different from any other protein in the database with a known function; when proteins do not follow the simple linear relationship between protein similarity and function; or when proteins ‘moonlight’.

Typically, function prediction methods are based on the assumption that the more similar the proteins (based on sequence or structure), the more alike their function. However, in reality, in some cases minor variations in protein sequences or structures can lead to a substantial change in molecular function and sometimes very different proteins can perform similar or identical functions.

Moonlighting proteins pose additional challenges for function prediction methods since, without any significant change in their sequence or structure, they are capable of carrying out more than one diverse functions based on where they are localized or their concentration. Phosphoglucose isomerase is one such ‘moonlighting’ protein which functions as a cell-metabolism enzyme inside the cell and as a nerve growth factor outside the cell.

In order to predict functions of uncharacterised genomic sequences, most of the recent function prediction methods combine several sequence-based and structure-based methods using machine-learning approaches, since most of them are hard to characterise using a single method. Many methods are available today which provide computational function predictions exploiting different approaches. However, it is essential for experimental biologists to understand whether the vast number of function prediction methods are of any value to them and whether the function predictions made by them can be relied upon. The Critical Assessment of Function Annotation (CAFA) experiment is one such major bioinformatics initiative; this aims to provide an unbiased large-scale assessment of protein function prediction methods: to decide which method performs better and to understand the ability of the whole field in providing function predictions for the colossal amounts of genomic data currently available.

The first CAFA experiment in 2013 was successful in providing an understanding of the performance of existing function prediction methods. At the same time, it also highlighted the major challenges and limitations of automated function prediction for computational biologists, database curators and experimental biologists. One of the main challenges is the importance of having accurate experimentally known functions of characterized proteins in databases – since all methods can only make predictions based on the available protein function data. However, significant biases have been recently identified in databases due to the recent increase in use of high-throughput experiments which contribute only very general functions towards experimental protein annotations. As a result of this, currently almost 25% of the characterized proteins in public databases have ‘binding’ listed as their function. Moreover, experimentally known functions for most proteins are incomplete, as the experiments are biased by experimenter choice and the annotations are limited by the scope of experiments. The existence of these biases not only affects our understanding of protein function in general but also affects the relationship between protein function prediction methods and the predicted function. Hence, it is essential for both developers of automated function prediction methods and experimental biologists who use computational function annotations to guide their experiments to be aware of the existence of such experimental biases.