Metagenomics

Sampling and Sequencing

1.1 Sampling:

The first step in a metagenomic study is to obtain the environmental sample. The DNA extracted should be representative of all cells present in the sample and sufficient amounts of high-quality nucleic acids must be obtained for subsequent library production and sequencing.

If the target community is associated with a host, then either fractionation or selective lysis might be suitable to ensure that minimal host DNA is obtained. Physical separation and isolation of cells from the samples might also be important to maximize DNA yield or avoid coextraction of enzymatic inhibitors (such as humic acids) that might interfere with subsequent processing. Certain types of samples often yield only very small amounts of DNA. Library production for most sequencing technologies require high nanograms or micrograms amounts of DNA, and hence amplification of starting material might be required. But it will be necessary to consider whether amplification is permissible as there are potential problems associated with reagent contaminations, chimera formation and sequence bias in the amplification.

Metadata are the 'data about the data': it includes detailed information about the three-dimensional (including depth, or height) geography and environmental features of the sample, physical data about the sample site, and the methodology of the sampling. There is a great need to have metadata in a standard, comprehensive and amenable way, as this may lead to biologically significant discoveries from statistically significant correlations analysis between the metagenomic data and the habitat-associated metadata.

1.2 Sequencing:

Over the past 10 years metagenomic shotgun sequencing has gradually shifted from classical Sanger sequencing technology to next-generation sequencing (NGS). There are two general sequencing strategies to obtain genome sequence data from microbiome samples: directed sequencing and shotgun sequencing of random clones. Directed sequencing is either (i) function-driven, whereby clone libraries from a microbiome sample are sequenced after being screened for a desired function; or (ii) driven by phylogenetic markers, whereby the DNA flanking taxonomic anchors, such as 16S rDNA, is sequenced in large-insert libraries. Shotgun sequencing of microbiome sample clone libraries follows a relatively unbiased approach, which provides a broad survey of the gene content and metabolic capabilities of a microbiome.

Both sequencing methods have their advantages and disadvantages, as shown below:

Method	Advantage	Disadvantage
Direct sequencing	Sequencing can be focused on any taxon of interest, regardless of prevalence in community Can determine linkage between large genome regions with confidence (within single individual) Good for microbial communities with high diversity	Do not reconstruct entire genome Cannot identify novel types Sequence data focused on single group, not entire community
Random shotgun sequencing	All genomes in the sample are sequenced Can identify novel types Can assemble full genomes of dominant types Good for communities with low diversity Good for communities with few dominant species	Only dominant genomes are well-represented Linkage between genome regions (contigs) inferred only Automated assembly of genome is problematic, requires manual checking for some assemblies

A combination of shotgun and directed sequence approaches may emerge in the future and thus combine the advantages of the broad coverage provided by shotgun sequencing with the ability of sampling specific genome areas in low abundance organisms without over-sequencing more abundant members of the microbiome. Here our discussion pertains to metagenome data generated using shotgun sequencing. Shotgun metagenomics can be divided into the following categories:

Fosmid, cosmid, and bacterial artificial chromosome (BAC)-derived metagenomic studies
Sanger sequencing–derived shotgun metagenomic studies
Next generation sequencing–derived shotgun metagenomic studies

Of the NGS technologies, both the 454/Roche and the Illumina/Solexa systems have now been extensively applied to metagenomic samples. A few additional sequencing technologies are available that might prove useful for metagenomic applications, now or in the near future. The Applied Biosystems SOLiD sequencer has been extensively used, for example, in genome resequencing. While none of the emerging sequencing technologies have been thoroughly applied and tested with metagenomics samples, they offer promising alternatives and even further cost reduction.

The following table shows a comparison between the yield, fragment length, and run times of the different sequencers.

Read length, error rate and throughput/coverage of NGS technologies determine the resolution at which we can investigate gene inventories of natural microbial communities. In this respect, advances in sequencing technologies will continue to shape the field of metagenomics and extend our possibilities to address habitats of increasing complexity.

1.3 Sequence Read Preprocessing:

Preprocessing comprises:

base calling of raw data coming off the sequencing machines:
Base calling is the procedure of identifying DNA bases from the readout of a sequencing machine. (tools: phred, Paracel's TraceTunner, ABI' KB, etc)
vector screening to remove cloning vector sequence:
Vector screening is the process of removing cloning vector sequences from base-called sequence reads. (tools: cross_match, LUCK, vectro_clip, etc)
quality trimming to remove low-quality bases (as determined by base calling)
contaminant screening to remove verifiable sequence contaminants

Errors in each of these steps can have greater downstream consequences in metagenomes than in genomes.

Reference:

1. Riesenfeld, Christian S., Patrick D. Schloss, and Jo Handelsman. "Metagenomics: genomic analysis of microbial communities." Annu. Rev. Genet. 38 (2004): 525-552.
2. Thomas, Torsten, Jack Gilbert, and Folker Meyer. "Metagenomics-a guide from sampling to data analysis." Microb Inform Exp 2.3 (2012).
3. Prakash, Tulika, and Todd D. Taylor. "Functional assignment of metagenomic data: challenges and applications." Briefings in bioinformatics 13.6 (2012): 711-727.
4. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010 Feb 26;6(2):e1000667. doi: 10.1371/journal.pcbi.1000667.
5. Gilbert JA, Dupont CL., Microbial metagenomics: beyond the genome. Ann Rev Mar Sci. 2011;3:347-71.
6. Markowitz, Victor M., et al. "An experimental metagenome data management and analysis system." Bioinformatics 22.14 (2006): e359-e367.
7. https://dornsife.usc.edu/labs/laketyrrell/metagenomics/
8. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician's guide to metagenomics. Microbiol Mol Biol Rev. 2008 Dec;72(4):557-78, Table of Contents. doi: 10.1128/MMBR.00009-08.
9. Teeling, Hanno, and Frank Oliver Glöckner. "Current opportunities and challenges in microbial metagenome analysis—a bioinformatic perspective." Briefings in bioinformatics 13.6 (2012): 728-742.