De Novo Genome Assembly

De Novo Genome sequencing and assembly is the method of choice to resolve the genetic makeup of an uncharacterized genome for which no prior reference or nucleotide sequence exits. With its prodigious throughput, efficiency and high speed next-generation sequencing enables us to sequence whole genome at high coverage. Sophisticate and complex assembly algorithms are then applied to resolve the genomics sequence which reveals the gene structure and positioning.

There are several advantages that a resolved genome could provide:

• Reference-based assembly: – Assess quality of sequencing (re-sequencing)

– Identify and annotate novel features, etc.

• De novo assembly: – generate reference – Identify novel features

– Annotate existing but un-annotated features

A typical genome assembly workflow is displayed, these steps make use of various bioinformatics tools and algorithm to generate final genome assembly and annotation.Genome Assembly and Annotation There are many genome sequencing techniques available, these include – Short read next-generation sequencing: Illumina and Ion Torrent

– Long read next-generation sequencing: Pacific Biosciences and Oxford Nanopore

Each of these sequencing techniques has its pros and cons related to genome assembly. Short reads are high quality, cost effective and provide deep sequencing coverage, however, they tend to have coverage bias in regions of high AT or GC content. Most of such high AT / GC content regions are repeats and low complexity regions. Short read lengths and biased coverage in repeat and low complexity regions results into fragmented genome assemblies that provide partial yet critical overview of genetic makeup of an organism. Most of the short read assemblers adopt De-Bruijn graph based assembly. Figure below adopted from Namiki et al, Nucleic Acid Research (2012) represents a typical De-Bruijn graph assembly protocol.

De-Bruijn Graph and Genome Assembly
Pacbio Genome Assembly

Long reads are >10kb average reads lengths but lower quality with random errors. Long reads sequencing requires high molecular weight starting DNA which at times require expertise in sample extraction. In general, long read assemblies have better contiguity, large N50 values and higher genomic coverage as compared to short reads. These long read assemblies, however, do require polishing using short reads to correct random base calls errors.
Long read assembly use OLC (Overlap Layout Consensus) approach to assemble genome. Here is a representation of Pacbio’s assembly process for bacterial genome called hierarchical genome assembly process (HGAP).

Table below compares Illumina and Pacbio bacterial assembly. Clearly long reads generate finished bacterial genomes ready to annotate.

Compares Illumina and Pacbio bacterial Assembly

Utturkar et al. A Case Study into Microbial Genome Assembly Gap Sequences and Finishing Strategies. Frontiers in Microbiology. 2017;8:1272. doi:10.3389/fmicb.2017.01272.

A number of recent studies have been published that use Pacbio long reads and various assemblers for genome assembly. Some of the key studies include:

Organism Technology Assembly tool Genome Size Contig N50 (Mb) Scaffold N50 (Mb)
Taeniopygia guttata Humming Bird PB FALCON 1.1GB 5.8 Na
Utricularia gibba Carnivorous Plant PD HGAP3 82mb 3.42 NA
Vitis vinifera Humming Bird Vine PB FALCON 500mb 2.39 NA
Oreochromis niloticus Nile Tilapia PB+RH +RAD map Canu 815Mb 3.1 NA
Gorilla gorilla Gorilla PB+BAC+Fosmids FALCON 3Gb 9.56 23.14
Lates calcalifer Sea Bass PB+OM+LM PB+OM+LM 700mb 1.72 25.85
Capra hircus Goat PB+OM+HiC PBcR 2.9Gb 18.7 87.28
Euclidium syriacum Mustard Family PB+OM FALCON 262Mb 3.3 17.5
Zea mays Maize PB+OM FALCON 2.1Gb 1.19 9.56
Homo sapiensHX1 Human PB+OM FALCON 3.5Gb 8.3 22

PB, PacBio SMRT data; OM, Optical mapping data; LR, Linked reads; LM, Linkage maps.

More recently Oxford Nanopore Technology (ONT) sequencing has immerged as another long read technology that is now activity used for genome assembly. ONT reads are similar to Pacbio in average read lengths and slightly high error rates. Illumina sequencing reads are used to error correct ONT reads and assemblies to enhance final basecall quality. Here are few recent studies that used ONT for genome sequencing.

  • Nanopore sequencing and assembly of a human genome with ultra-long reads. 29th Jan, 2018, Nature Biotech
  • High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell. 14th June 2017, Nature Communications
  • Community-led comparative genomic and phenotypic analysis of the aquaculture pathogen Pseudomonas baetica a390T sequenced by Ion semiconductor and Nanopore technologies. 22 March 2018 – FEMS Microbiology Letters
  • De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing. 3rd May 2017 – f1000
  • De novo Assembly of a New Solanum pennellii Accession Using Nanopore Sequencing
  • 21st April 2017, Plant Cell 1010Genome have developed robust pipeline and scientific expertise to handle any single platform or hybrid approach for denovo genome assembly
Organism Genome Size NGS Coverage Assembly Size # of Contig Contig N50 (Mb)
Bacteria 2Mb Illumina (80x) 1.96 8 Scaffolds
Bacteria 4.3Mb Illumina (100x) 3.9 21 Scaffolds
Fungal 60Mb Illumina (250x) 56.2 6654 80.9
Bacteria 4.3 Pacbio (70x) 4.3Mb Single closed plus plasmid
Yeast 12Mb Pacbio (60x) 12.3 27 499kb
Multi-nucleated Fungi 42Mb Pacbio (60x) 42.5Mb 298 419kb
Rice 400Mb Illumina (80x) 340Mb 630392 19.3kb
Rice 400Mb Illumina (80x) and Pacbio (20x) 390Mb and Pacbio (20x) 3800 1.2Mb
Bean 440Mb and Pacbio (20x) 435Mb 3007 0.9Mb
fish 700Mb Pacbio (90x) 680Mb 3800 1.2Mb
Insect 900Mb Pacbio (63x) 860 602
fish 1Gb Pacbio (70x) 955Mb 6230 1.4Mb
Bacteria 2Mb ONT (50x) Illumina(30x) 1.99Mb Single Closed Plus Plasmid 1.4Mb
Yeast 12.5Mb ONT (60x) Illumina(30x) 13Mb 75 401kb

Convinced that we can handle genome assemblies!!