Shotgun Metagenomics

Microorganisms are omnipresent in nature. Diverse communities of microbes thrive in environments ranging from soil, water bodies, human gut, to inhospitable habitats like acid mines and hot springs. Multiple studies based on cultured microbial community has highlighted the importance of these microbes and the role they plan in their ecosystem. Several microbes, however, cannot be cultured or there is loss of microbial diversity when attempts are made to establish microbial community in lab conditions. 16s rRNA and shotgun metagenomics provide a window to look at the microbial genetic diversity and understand microbial interactions. Shotgun metagenomic sequencing allows to sample majority of genes across organisms in a complex sample.

NGS provides deep sequencing coverage that enables to detect low abundance members of the microbial community that are missed by other conventional methods. Metagenomic provides an opportunity to simultaneously explore two aspects of a microbial community: who is there and what are they capable of doing?

Shotgun Metagenomics
Sequence library
Metagenomics Data Analysis

Marker Gene AnalysisMarker gene analysis is one of the most straightforward and computationally efficient ways of quantifying a metagenome’s taxonomic diversity. This procedure involves comparing metagenomic reads to a database of taxonomically informative gene families (i.e., marker genes), identifying those reads that are marker gene homologs, and using sequence or phylogenetic similarity to the marker gene database sequences to taxonomically annotate each metagenomic homolog. The most frequently used marker genes include rRNA genes or protein coding genes that tend to be single copy and common to microbial genomes. Because this approach involves comparing metagenomic reads to a relatively small database for the purpose of a similarity search (e.g., not all gene families are taxonomically informative), marker gene analysis can be a relatively rapid way to estimate the diversity of a metagenome.


A related strategy, known as binning, attempts to assign every metagenomic sequence to a taxonomic group. Generally, each sequence is either (1) classified into a taxonomic group (e.g., OTU, genus, family) through comparison to some referential data or (2) clustered into groups of sequences that represent taxonomic groups based on shared characteristics (e.g., GC content). Binning plays an important role in the analysis of metagenomes. First, depending on the method used, binning may provide insight into the presence of novel genomes that are difficult to otherwise identify. Second, it provides insight into the distinct numbers and types of taxa in the community.


Assembly merges collinear metagenomic reads from the same genome into a single contiguous sequence (i.e., contig) and is useful for generating longer sequences, which can simplify bioinformatic analysis relative to unassembled short metagenomic reads. In some instances, complete or nearly complete genomes can be assembled, which provides insight into the genomic composition of uncultured organisms found in a community

What are they doing?


Metagenomes provide insight into a community’s physiology by clarifying the collective functions that are encoded in the genomes of the organisms that make up the community. The functional diversity of a community can be quantified by annotating metagenomic sequences with functions (Figure ​Figure33). This usually involves identifying metagenomic reads that contain protein coding sequences and comparing the coding sequence to a database of genes, proteins, protein families, or metabolic pathways for which some functional information is known. The function of the coding sequence is inferred based on its similarity to sequences in the database. Doing this for all metagenomic sequences produces a profile that describes the number of distinct types of functions and their relative abundance in the metagenome. This profile can be used to compare metagenomes to identify those communities that are metabolically similar (Human Microbiome Project Consortium, 2012b), ascertain how various treatments influence the functional composition of the community (Looft et al., 2012), and reveal those functions that associate with specific environmental or host-physiological variables (i.e., biomarkers) and may be useful for environmental or host diagnosis (Morgan et al., 2012). Metagenomes may also reveal the presence of novel genes (Nacke et al., 2012) or provide insight into the ecological conditions associated with those genes for which the function is currently unknown (Buttigieg et al., 2013). In general, metagenome functional annotation involves two non-mutually exclusive steps: gene prediction and gene annotation.

Metagenome Gene prediction functional Annotation

Gene prediction determines which metagenomic reads contain coding sequences. Once identified, coding sequences can be functionally annotated. Gene prediction can be conducted on assembled or unassembled metagenomic sequences. For assembled metagenomes with full-length coding sequences, gene prediction is akin to the framework used during the analysis of whole genome sequences, with the caveat that some prediction algorithms require species-specific parameters that may not always be appropriate when the contigs have been sampled from diverse or novel lineages.


Once coding sequences in a metagenome are predicted, they can be subject to functional annotation. The most common way this is accomplished is by classifying the predicted metagenomic proteins into protein families. A protein family is a group of evolutionarily related protein sequences, or subsequences in the case of protein domain families (e.g., Pfam; Finn et al., 2014). They are usually characterized by comparing full-length protein sequences that have been identified through genome sequencing projects. Because the proteins in a family share a common ancestor, they are thought to encode similar biological functions. If a metagenomic sequence is determined to be a homolog of this family (i.e., it is classified as being a member of the family), then it is inferred that the sequence encodes the family’s function. Classification of an assembled or unassembled metagenomic protein sequence into a protein family usually requires comparing the metagenomic protein to either a database of protein sequences, each of which is designated as being a member of a family, or comparison of the sequence to a probabilistic model that describes the diversity of proteins in the family (e.g., HMMs). Once the metagenomic sequence has been compared to all proteins or all models, it can either be classified into (1) a single family (e.g., the family with the best hit), (2) a series of families (e.g., all families that exhibit a significant classification score), or (3) no family, which suggests that the protein may be novel, highly diverged, or spurious. There are exceptions to this annotation framework, such as the gene recruitment procedure mentioned in the Gene Prediction section, though they are less commonly used.