Gene Prediction
Genes are the basic functional unit in the genome, which may constitute larger functional units such as operons, transcriptional units, and functional networks. Gene prediction (or gene calling) is the procedure of identifying protein and RNA sequences coded on the sample DNA. Depending on the applicability and success of the assembly, gene prediction can be done on postassembly contigs, on reads from the unassembled metagenome, and, finally, for a mixture of contigs and individual unassembled reads.

There are two main approaches for gene prediction: “evidence-based” gene prediction methods and “ab initio” gene prediction:
  • The “evidence-based” gene prediction methods use homology searches to identify genes similar to those observed previously.
  • “ab initio” gene prediction, relies on intrinsic features of the DNA sequence to discriminate between coding and noncoding regions, allowing the identification of genes without homologs in the available databases. Those tools are mostly based on supervised learning and statistical pattern recognition methods. Most models use Markov models or Hidden markov models. The use of gene training sets, i.e., sets of parameters derived from known genes of the same or related organisms, can enhance the quality of the predicted genes for some of those programs, while others are self-trained on the target sequence.
The following table shows a list of commonly used tools for gene prediction.
YearToolsShort DescriptionsURL
2007FGENESHFGENESH is an application for finding (fragmented) genes in short reads.FGENESH
2010FragGeneScanFragGeneScan is a HMM-based gene structure prediction (multiple genes, both chains) tool.FragGeneScan
2005GeneMarkGeneMark is a family of gene prediction programs developed at Georgia Institute of Technology. .GeneMark
2009GENSCANGENSCAN can predict the locations and exon-intron structures of genes in genomic sequences from a variety of organisms..GENSCAN
2007GlimmerGlimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses.Glimmer
2012Glimmer-MGGlimmer-MG is a system for finding genes in environmental shotgun DNA sequences. Glimmer-MG
2000HMMgeneHMMgene is a tool to do prediction of vertebrate and C. elegans genes. HMMgene
2007MEDMED is a new non-supervised gene prediction algorithm for bacterial and archaeal genomes. MED
2008MetaGeneAnnotator MetaGeneAnnotator is a gene-finding program for prokaryote and phage. MetaGeneAnnotator
2013MetaGUNMetaGUN is a gene prediction method for metagenomic fragments based on a machine learning approach of SVM. MetaGUN
2013MGCMGC is an application for finding complete and incomplete genes in metagenomic reads. MGC
2009OrpheliaOrphelia is a metagenomic ORF finding tool for the prediction of protein coding genes in short, environmental DNA sequences with unknown phylogenetic origin. Orphelia
2012MetaProdigalProdigal can run in metagenomic mode and analyze sequences even when the organism is unknown.MetaProdigal

The quality of gene predictions in microbial metagenome data sets is inferior to those of sequenced genomes. Combining multiple gene finders, screening intergenic regions for overlooked genes and using dedicated frameshift detectors are common strategies to overcome at least some of these limitations.

Reference:

1. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010 Feb 26;6(2):e1000667. doi: 10.1371/journal.pcbi.1000667.
2. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician's guide to metagenomics. Microbiol Mol Biol Rev. 2008 Dec;72(4):557-78, Table of Contents. doi: 10.1128/MMBR.00009-08.