Binning
A metagenomic sequence pipeline produces a collection of reads, contigs, and genes. Associating these data with the organisms from which they were derived is highly desirable for the interpretation of the ecosystem. This process of association between sequence data and contributing species (or higher level taxonomic groups) is called binning or classification.

Analyses of datasets obtained using shotgun sequencing involve characterizing the taxonomic and functional diversity of a given environment by analyzing DNA fragments originating from the genomes of resident microbes. Existing binning methods can be classified into two categories, namely taxonomy dependent and taxonomy independent:
  • Taxonomy depedent:
    A majority of methods available for binning datasets obtained using shotgun sequencing belong to the taxonomy-dependent category. In these methods, the extent of ‘similarity’ of reads with sequences (in reference databases) or pre-computed models (built using sequences in reference databases) drives the assignment process. Based on the strategy used for comparing reads with sequences/pre-computed models, taxonomy-dependent methods can be sub-classified into alignment-based, composition-based and hybrid methods.

    Here, we developed a novel taxonomy-dependent and alignment-free approach called MBMC (Metagenomic Binning by Markov Chains).
  • Taxonomy indepedent:
    Taxonomy independent methods simply group/bin reads in a given dataset based on their mutual similarity and do not involve a database comparison step.

    Unsupervised methods usually bin reads based on three observations:
    • The k-mer frequency from reads of a genome is usually linearly proportional to that of the genome's abundance.
    • Sufficiently long w-mers are usually unique in each genome.
    • The short q-mer frequency distributions (or q-mer distributions in short) of individual sufficiently long reads sampled from the same genome or similar genomes are similar.
    Here, we proposed a taxonomy indepedent method, called MBBC (Metagenomic Binning Based on Composition).

The following table shows a list of commonly used tools for metagenomic binning.
CategoriyYearToolsShort DescriptionsURL
Taxonomy-dependent methods2012AmphoraNetThe webserver implementation of the AMPHORA2 pipeline for metagenomic analysis of shotgun sequencing data.AmphoraNet
2008 CARMA A software pipeline for characterizing the taxonomic composition and genetic diversity of short-read metagenomes. CARMA
2011ClaMS A sequence composition-based classifier for metagenomic sequences ClaMS
2010DiScRIBinATEDistance Score Ratio for Improved Binning and Taxonomic Estimation. DiScRIBinATE
2012GenometaA Java based local bioinformatics program which allows rapid analysis of metagenomic short read datasets. Genometa
2014KRAKENA system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. KRAKEN
2013LMATDesigned to efficiently assign taxonomic labels to as many reads as possible in very large metagenomic datasets and report the taxonomic profile of the input sample. LMAT
2010MARTAThis java-based software blasts each sequence that you provide it, and then looks for a consensus taxon among the top-hits returned from blast. MARTA
2012MetaPhlAn A computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn relies on unique clade-specific marker genes identified from 3,000 reference genomes. MetaPhlAn
2011 MetaPhyler A taxonomic classifier for metagenomic shotgun reads, which uses phylogenetic marker genes as a taxonomic reference. MetaPhyler
2010 MG-RAST An automated analysis platform for metagenomes providing quantitative insights into microbial populations based on sequence data. MG-RAST
2010 MLTreeMap Analyzes DNA sequences and determines their most likely phylogenetic origin. MLTreeMap
2014 MyTaxa A homology-based bioinformatics framework to classify metagenomic and genomic sequences. MyTaxa
2012 NBC The Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads. NBC
2007 PhyloPythia Accurate phylogenetic classification of variable-length DNA fragments. PhyloPythia
2012 PhyloPythiaS The Web Server for Taxonomic Assignment of Metagenome Sequences. PhyloPythiaS
2009 Phymm/PhymmBL Phylogenetic Classification of Metagenomic Data with Interpolated Markov Models. Phymm/PhymmBL
2010 Pplacer Places query sequences on a fixed reference phylogenetic tree to maximize phylogenetic likelihood or posterior probability according to a reference alignment. Pplacer
2011 ProViDE A novel similarity based binning algorithm that uses a customized set of alignment parameter thresholds/ranges, specifically suited for the accurate taxonomic labelling of viral metagenomic sequences. ProViDE
2011 RAIphy A semi-supervised metagenomic fragment classification program. RAIphy
2012 Sequedex A signature-based method to classify the function and phylogeny of reads as short as 30 bp. Sequedex
2009 SOrt-ITEMS Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. SOrt-ITEMS
2011 SPHINX A hybrid binning approach that achieves high binning efficiency by utilizing both 'compositional' and 'similarity' features of the query sequence during the binning process. SPHINX
2009 TACOA Software that can accurately predict the taxonomic origin of genomic fragments from metagenomic data sets by combining the advantages of the k -NN approach with a smoothing kernel function. TACOA
2011 TaxSOM A tool for taxonomic classification of DNA fragments, as they are typically obtained in metagenome projects. TaxSOM
2010 Treephyler A tool for fast taxonomic profiling of metagenomes. Treephyler
2009 WebCARMA Taxonomic classification of metagenomic shotgun sequences. WebCARMA
2013 MEGAN5 Interactively analyze and compare metagenomic and metatranscriptomic data, both taxonomically and functionally MEGAN5
2011 ProViDE A software tool for accurate estimation of viral diversity in metagenomic samples ProViDE
2011 PaPaRa Parsimony-based Phylogeny-Aware Read alignment program PaPaRa
2014 MetaCluster-TA A software for binning and annotating short paired-end reads. MetaCluster-TA
Taxonomy-independent methods2011AbundanceBin An abundance-based tool for binning metagenomic sequences, such that the reads classified in a bin belong to species of identical or very similar abundances. AbundanceBin
2008CompostBin A DNA-composition-based binning algorithm for classifying metagenomic reads. CompostBin
2012 MetaCluster 5.0 MetaCluster5.0 is an unsupervised binning method. MetaCluster 5.0
2004 TETRA The standalone-programs can be used to calculate, how well tetranucleotide usage patterns in DNA sequences correlate. TETRA

There is no standard for the taxonomic classification of metagenome sequences. Also, taxonomic sequence classification can be error prone, in particular for habitats with a complex diversity or high proportions of as yet barely characterized taxa.

Reference:
1. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician's guide to metagenomics. Microbiol Mol Biol Rev. 2008 Dec;72(4):557-78, Table of Contents. doi: 10.1128/MMBR.00009-08.
2. Mande SS, Mohammed MH, Ghosh TS. Classification of metagenomic sequences: methods and challenges. Brief Bioinform. 2012 Nov;13(6):669-81. doi: 10.1093/bib/bbs054. Epub 2012 Sep 8.
3. Wang Y, Leung HC, Yiu SM, Chin FY. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics. 2012 Sep 15;28(18):i356-i362. doi: 10.1093/bioinformatics/bts397.