Assembly
Assembly is the process of combining sequence reads into contiguous stretches of DNA called contigs, based on sequence similarity between reads. The consensus sequence for a contig is either based on the highest-quality nucleotide in any given read at each position or based on majority rule.
Two strategies can be employed for metagenomics samples: reference-based assembly (co-assembly), and de novo assembly.
-
Reference based assembly works well, if the metagenomic dataset contains sequences where closely related reference genomes are available. However, differences in the true genome of the sample to the reference, such as a large insertion, deletion, or polymorphisms, can mean that the assembly is fragmented or that divergent regions are not covered.
-
De novo assembly typically requires larger computational resources. A whole class of assembly tools based on the de Bruijn graphs was specifically created to
handle very large amounts of data. Machine requirements for the de Bruijn assemblers are still significantly higher than for reference-based assembly (co-assembly).
|
The following table shows a list of commonly used tools for sequence assembly. Some of them are specific for metagenomic assembly. Metagenome assemblers differ from conventional genome assemblers in that they are designed for data containing more than one species, and so they generally have algorithms in place to separate species where possible, decreasing the amount of chimeric contigs constructed. They also tend not to rely on even coverage (the number of reads undering each consensus base is called depth or coverage) as a means of verifying assemblies unlike conventional genome assemblers, since coverage is not even in metagenomes because species have different abundances.
Year | Tools | Short Descriptions | URL |
2002 | Arachne | Arachne was designed for long Sanger-chemistry reads. | Arachne |
2004 | Celera | Celera Assembler is a de novo whole-genome shotgun (WGS) DNA sequence assembler. It reconstructs long sequences of genomic DNA from fragmentary data produced by whole-genome shotgun sequencing. | Celera Assembler |
2007 | PHRAP | phrap is a program for assembling shotgun DNA sequence data. Among other features, it allows use of the entire read and not just the trimmed high quality part, it uses a combination of user-supplied and internally computed data quality information to improve assembly accuracy in the presence of repeats, it constructs the contig sequence as a mosaic of the highest quality read segments rather than a consensus, it provides extensive assembly information to assist in trouble-shooting assembly problems, and it handles large datasets. | PHRAP |
2008 | Velvet | Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454. | Velvet |
2010 | SOAPdenovo | SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. | SOAPdenovo |
2011 | Genovo | Genovo uses a probabilistic model that calculates different coverage values to assemble metagenomes | Genovo |
2011 | Meta-IDBA | Meta-IDBA is an iterative De Bruijn Graph De Novo short read assembler specially designed for de novo metagenomic assembly. | Meta-IDBA |
2011 | Minimo | Minimo is designed to assemble small datasets and has been used for virome analyses | AMOS |
2012 | MetaVelvet | MetaVelvet is an extension of Velvet assembler to de novo metagenome assembly from short sequence reads | MetaVelvet |
2012 | IDBA-UD | IDBA-UD is a iterative De Bruijn Graph De Novo Assembler for Short Reads Sequencing data with Highly Uneven Sequencing Depth. It is an extension of IDBA algorithm. | IDBA-UD |
2012 | MAP | MAP is a de novo metagenomic assembly program for shotgun DNA reads. | MAP |
2012 | MOCAT | MOCAT is a metagenomics assembly and gene prediction toolkit. | MOCAT |
2012 | GeneStitch | GeneStitch is a novel way of using the de Bruijn graph assembly of metagenomes to improve the assembly of genes. | GeneStitch |
2012 | Ray Meta | Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. |
Ray Meta |
2012 | VICUNA | VICUNA is a de novo assembly program targeting populations with high mutation rates. |
VICUNA |
2013 | MetAMOS | MetAMOS is a modular and open source metagenomic assembly and analysis pipeline. | MetAMOS |
2014 | GARM | GARM is (Genome Assembler, Reconcilation and Merging) a new software pipeline to merge and reconcile assemblies from different algorithms or sequencing technologies. | GARM |
2013 | PRICE | PRICE (Paired-Read Iterative Contig Extension) is a de novo genome assembler implemented in C++. | PRICE |
2013 | Xgenovo | Xgenovo generates quality assemblies with paired end reads. | Xgenovo |
na | Newbler | Newbler is a software package for de novo DNA sequence assembly. It is designed specifically for assembling sequence data generated by the 454 GS-series of pyrosequencing platforms sold by 454 Life Sciences, a Roche Diagnostics company. | Newbler |
|
In all but the most species-poor metagenome, a full assembly is not possible:
-
The sampling is incomplete, and many if not all species’ genomes are partially sampled, if at all;
-
The species information itself is incomplete, and it is difficult to map individual reads to their species of origin.
-
Analysis of genomic elements using metagenomic data is generally limited to the first three or four rows in the following Table.
The following table shows the information contained in different lengths of genomic DNA.
from: Table 2 of " A primer on metagenomics."
|
Some metagenomic assembling problems include:
-
Coverage (Coverage of a genome is defined as the mean
number of times a nucleotide is being sequenced.) is usually incomplete, since environmental sequence sampling rarely produces all the sequences required for assembly.
-
There is also the danger of assembling sequences from different OTUs, creating interspecies chimeras.
-
For short reads, they need to be produced in large quantities, and their short lengths means that there are many identical, or nearly identical, reads.
|
Other assembling problems are posed by the sequencing technologies as summarized in the following table.
Sequencing technology feature | Assembly challenge |
Short reads | Difficulty assembling repeats |
Mate-pairs absent or difficult/expensive to obtain | Difficulty assembling repeats Lack of scaffolding information |
New types of errors | Need to modify existing software and/or incorporate technology-specific features in assembly software |
Large amounts of data (number of reads and size of auxiliary information) | Efficiency issues Require parallel implementations or specialized hardware when applied to large genomes |
Thus, new sequenicng and assembly technologies are expected to address all of these issures in the future.
|
Reference:
1. Vázquez-Castellanos, Jorge F., et al. "Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut." BMC genomics 15.1 (2014): 37.
2. Prakash, Tulika, and Todd D. Taylor. "Functional assignment of metagenomic data: challenges and applications." Briefings in bioinformatics 13.6 (2012): 711-727.
3. Thomas, Torsten, Jack Gilbert, and Folker Meyer. "Metagenomics-a guide from sampling to data analysis." Microb Inform Exp 2.3 (2012).
4. Pop, Mihai. "Genome assembly reborn: recent computational challenges." Briefings in bioinformatics 10.4 (2009): 354-366.