Assembly
Assembly is the process of combining sequence reads into contiguous stretches of DNA called contigs, based on sequence similarity between reads. The consensus sequence for a contig is either based on the highest-quality nucleotide in any given read at each position or based on majority rule.
Two strategies can be employed for metagenomics samples: reference-based assembly (co-assembly), and de novo assembly.
  • Reference based assembly works well, if the metagenomic dataset contains sequences where closely related reference genomes are available. However, differences in the true genome of the sample to the reference, such as a large insertion, deletion, or polymorphisms, can mean that the assembly is fragmented or that divergent regions are not covered.
  • De novo assembly typically requires larger computational resources. A whole class of assembly tools based on the de Bruijn graphs was specifically created to handle very large amounts of data. Machine requirements for the de Bruijn assemblers are still significantly higher than for reference-based assembly (co-assembly).
The following table shows a list of commonly used tools for sequence assembly. Some of them are specific for metagenomic assembly. Metagenome assemblers differ from conventional genome assemblers in that they are designed for data containing more than one species, and so they generally have algorithms in place to separate species where possible, decreasing the amount of chimeric contigs constructed. They also tend not to rely on even coverage (the number of reads undering each consensus base is called depth or coverage) as a means of verifying assemblies unlike conventional genome assemblers, since coverage is not even in metagenomes because species have different abundances.
YearToolsShort DescriptionsURL
2002ArachneArachne was designed for long Sanger-chemistry reads.Arachne
2004CeleraCelera Assembler is a de novo whole-genome shotgun (WGS) DNA sequence assembler. It reconstructs long sequences of genomic DNA from fragmentary data produced by whole-genome shotgun sequencing. Celera Assembler
2007PHRAPphrap is a program for assembling shotgun DNA sequence data. Among other features, it allows use of the entire read and not just the trimmed high quality part, it uses a combination of user-supplied and internally computed data quality information to improve assembly accuracy in the presence of repeats, it constructs the contig sequence as a mosaic of the highest quality read segments rather than a consensus, it provides extensive assembly information to assist in trouble-shooting assembly problems, and it handles large datasets.PHRAP
2008VelvetVelvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454.Velvet
2010SOAPdenovoSOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. SOAPdenovo
2011GenovoGenovo uses a probabilistic model that calculates different coverage values to assemble metagenomesGenovo
2011Meta-IDBAMeta-IDBA is an iterative De Bruijn Graph De Novo short read assembler specially designed for de novo metagenomic assembly.Meta-IDBA
2011MinimoMinimo is designed to assemble small datasets and has been used for virome analysesAMOS
2012MetaVelvetMetaVelvet is an extension of Velvet assembler to de novo metagenome assembly from short sequence readsMetaVelvet
2012IDBA-UDIDBA-UD is a iterative De Bruijn Graph De Novo Assembler for Short Reads Sequencing data with Highly Uneven Sequencing Depth. It is an extension of IDBA algorithm. IDBA-UD
2012MAPMAP is a de novo metagenomic assembly program for shotgun DNA reads.MAP
2012MOCATMOCAT is a metagenomics assembly and gene prediction toolkit.MOCAT
2012GeneStitchGeneStitch is a novel way of using the de Bruijn graph assembly of metagenomes to improve the assembly of genes. GeneStitch
2012Ray MetaRay Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. Ray Meta
2012VICUNAVICUNA is a de novo assembly program targeting populations with high mutation rates. VICUNA
2013MetAMOSMetAMOS is a modular and open source metagenomic assembly and analysis pipeline.MetAMOS
2014GARMGARM is (Genome Assembler, Reconcilation and Merging) a new software pipeline to merge and reconcile assemblies from different algorithms or sequencing technologies.GARM
2013PRICEPRICE (Paired-Read Iterative Contig Extension) is a de novo genome assembler implemented in C++.PRICE
2013XgenovoXgenovo generates quality assemblies with paired end reads.Xgenovo
naNewblerNewbler is a software package for de novo DNA sequence assembly. It is designed specifically for assembling sequence data generated by the 454 GS-series of pyrosequencing platforms sold by 454 Life Sciences, a Roche Diagnostics company.Newbler
In all but the most species-poor metagenome, a full assembly is not possible:
  • The sampling is incomplete, and many if not all species’ genomes are partially sampled, if at all;
  • The species information itself is incomplete, and it is difficult to map individual reads to their species of origin.
  • Analysis of genomic elements using metagenomic data is generally limited to the first three or four rows in the following Table.
The following table shows the information contained in different lengths of genomic DNA.

from: Table 2 of " A primer on metagenomics."

Some metagenomic assembling problems include:
  • Coverage (Coverage of a genome is defined as the mean number of times a nucleotide is being sequenced.) is usually incomplete, since environmental sequence sampling rarely produces all the sequences required for assembly.
  • There is also the danger of assembling sequences from different OTUs, creating interspecies chimeras.
  • For short reads, they need to be produced in large quantities, and their short lengths means that there are many identical, or nearly identical, reads.
Other assembling problems are posed by the sequencing technologies as summarized in the following table.
Sequencing technology featureAssembly challenge
Short readsDifficulty assembling repeats
Mate-pairs absent or difficult/expensive to obtainDifficulty assembling repeats Lack of scaffolding information
New types of errorsNeed to modify existing software and/or incorporate technology-specific features in assembly software
Large amounts of data (number of reads and size of auxiliary information)Efficiency issues Require parallel implementations or specialized hardware when applied to large genomes

Thus, new sequenicng and assembly technologies are expected to address all of these issures in the future.

Reference:

1. Vázquez-Castellanos, Jorge F., et al. "Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut." BMC genomics 15.1 (2014): 37.
2. Prakash, Tulika, and Todd D. Taylor. "Functional assignment of metagenomic data: challenges and applications." Briefings in bioinformatics 13.6 (2012): 711-727.
3. Thomas, Torsten, Jack Gilbert, and Folker Meyer. "Metagenomics-a guide from sampling to data analysis." Microb Inform Exp 2.3 (2012).
4. Pop, Mihai. "Genome assembly reborn: recent computational challenges." Briefings in bioinformatics 10.4 (2009): 354-366.