DNA trace assemblyDNA sequence assembly
Welcome to DNA BASER’s official web siteFeatures and performancesScreen shots-DNA Baser Sequence AssemblerOrder your copy nowDNA sequencing info and newsDownload a full working version right nowContact us if you need more information
sequence assembly
sequence assembly

Introduction to genome sequence assembly

 

 

 

Table of Contents

 

 

Introduction in DNA sequencing

Deoxyribonucleic acid (DNA) is sequence of molecules that contains the genetic information used in the development and functioning of all known living organisms. The main purpose of DNA is the long-term storage of information.
All information that defines a living organism is encoded this string of DNA molecules. There are four basic types of molecules, represented by four letters in our alphabet: A, C, T, G. These molecules are called nucleotides or bases. The A molecule is attracted by the C molecule and T molecule is attracted by the G molecule. They are called complementary molecules.
In most organisms DNA is structured as two strings twisted around each other (also called strands) into a double-helix and held together by bonds between complementary nucleotides A-T and C-G.  
The two strands have a natural orientation defined by their chemical structure and are twisted together in opposite direction.  Both strands, thus, contain the same information and the sequence of one strand can be obtained from the sequence of the other strand by reverse complementation, namely by reversing it's sequence and then replacing each nucleotide with its complement (replacing each A with a T, each G with a C and so on).  This characteristic of DNA is essential to life as it allows the replication of the DNA molecule - an important element of our ability to pass genetic information to our offspring. 

 

DNA sequencing and assembly

The process of decoding (reading) a fragment of DNA is called sequencing. It was invented in 1975 by Frederick Sanger which developed the sequencing technology that is still used today: the shotgun DNA sequencing.

Even if the shotgun DNA sequencing has been continuously improved over the time it has a major drawback: it can cannot read more than 1000-2000 base-pairs (nucleotides) of DNA at a time. This can be a serious limitation if we consider that even the simplest viruses contain tens of thousands of base-pairs, bacteria contain millions, and mammalian genomes contain billions of base-pairs. 

However, the biologists developed a technique to sheared the DNA sequence of an organism into a large number of small DNA fragments. This technique is called shotgun sequencing.

 

This technique has three steps:

Step 1 - Original DNA is broken into a collection of fragments

Step 2 - All DNA fragments are sequenced independently as in the figure bellow

Step 3 - With the aid of computer, using molecular biology software called DNA sequence assemblers, the resulting sequences are joined together

 

genome sequence assembly

Assembly statistics

The assembler relies on the basic assumption that two sequences reads (two strings of letters produced by the sequencing machine) that share a same string of bases originated from the same place in the genome (Figure 3).  Using such overlaps between the sequences, the DNA sequence assembler software can join the sequences back together. 
The shotgun sequencing process is only possible once enough sequences are generated to cover the genome 8 to 10 times. 
Intuitively, this phenomenon can be understood by thinking of a sidewalk as it begins to rain.  As raindrops fall randomly across the sidewalk, dry spots persist for quite a while, corresponding to regions of the genome that are not represented in the set of shotgun reads.  Mathematically, this phenomenon was modeled by Eric Lander and Michael Waterman in 1988.  They examined the correlation between the oversampling of the genome (also called coverage) and the number of contiguous pieces of DNA (commonly called contigs) that can be re-constructed by an idealized assembly program.  Figure 4 shows a plot of the Lander-Waterman equation for a genome of 1Mbp (mega base pairs = 1,000,000 base pairs).  Between 8 and 10-fold coverage the model predicts that most of the genome will be assembled into a small number of contigs (approx. 5 for a 1Mbp genome). 

Assembly statistics

Figure 4.  Lander-Waterman estimation of number of contigs w.r.t. genome coverage

 

 

Assembly challenges: non-random shearing and repeats

Ideally, an assembly program should produce one contig for every chromosome of the genome being sequenced.  In all but the simplest cases, however, many contigs are produced due to a combination of factors.  Even at 8-10 fold coverage, there is a non-zero probability that some portion of the genome remains unsequenced.  More importantly, however, the distribution of the sheared fragments along the genome cannot be modeled as a perfect Poisson process.  Sanger sequencing requires many copies of each fragment in order for the sequencing chemistry to be possible.  Each shotgun fragment must be cloned, a procedure usually performed by inserting the fragment into the cell of the Escherichia coli bacterium (called a vector) and allowing this bacterium to grow, thereby replicating the fragment as E. coli replicates its own genome.  In most genomes, certain regions are toxic to the E. coli bacterium, leading to the presence of gaps in the coverage.

The ability of an assembly program to produce a single contig is also limited by regions of the genome that occur in multiple near-identical copies throughout the genome (repeats).  The reads originating from different copies of a repeat appear identical to the assembler and cause assembly errors.  A simple example is shown in Figure 5, where the assembler incorrectly collapses the two copies of repeat A leading to the creation of two contigs instead of one (Figure 6).

Assembly challenges: non-random shearing and repeat
Figure 5. Two copies of a repeat along a genome.  The reads colored in red and those colored in yellow appear identical to the assembly program.

DNA assembly challenges
Figure 6.  Genome mis-assembled due to a repeat.  The assembly program incorrectly combined the reads from the two copies of the repeat leading to the creation of two separate contigs

 

Scaffolding

The contigs produced by an assembly program can be ordered and oriented along a chromosome using additional information contained in the shotgun data.  In most sequencing projects, the sizes of the fragments generated through the shotgun process are carefully controlled, thus providing a link between the sequence reads generated from the ends of a same fragment (called paired ends or mate pairs).  In a typical shotgun project, multiple libraries -- collections of fragments of similar sizes -- are usually generated, providing the assembler with additional constraints: within the assembly the paired end reads must be placed at a distance consistent with the size of the library from which they originate and must be oriented towards each other.  Within an assembly each read is assigned an orientation corresponding to the DNA strand from which the read was generated.  The constraints provided by mate pairs lead to constraints on the relative order and orientation of the contigs (Figure 7).  The process through which the read pairing information is used to order and orient the contigs along a chromosome is called scaffolding. 


DNA scaffolding
Figure 7.  A scaffold of 3 contigs (the thick arrows) held together by mate pairs.  Thin lines connect the paired ends.

 

Finishing

The ultimate goal of any sequencing project is to determine every single base-pair of the original set of chromosomes.  As we described above, rarely is an assembly program able to reconstruct a single piece of DNA per chromosome, leading to gaps in the reconstruction of the genome.  These gaps are filled in through directed sequencing experiments in a process called finishing or gap closure.  At this stage in the sequencing project, additional laboratory experiments and extensive manual curation are performed to validate the correctness of the final assembly, leading to a high-quality reconstruction of the original genome.

 

Assembly algorithms

The many assembly programs available to researchers differ in the details of their implementation and of the algorithms employed, however they all primarily fall into four general categories:

Greedy assemblers - The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other.  An example is shown in Figure 8, where the assembler joins, in order,  reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap  = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input.  One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.

suquence assembly algorithms
Figure 8.  Greedy assembly of four reads.

Overlap-layout-consensus - The relationships between the reads provided to an assembler can be represented as a graph, where the nodes represent each of the reads and an edge connects two nodes if the corresponding reads overlap.  The assembly problem thus becomes the problem of identifying a path through the graph that contains all the nodes - a Hamiltonian path (Figure 9).  This formulation allows researchers to use techniques developed in the field of graph theory in order to solve the assembly problem.  An assembler following this paradigm starts with an overlap stage during which all overlaps between the reads are computed and the graph structure is computed.  In a layout stage, the graph is simplified by removing redundant information.  Graph algorithms are then used to determine a layout (relative placement) of the reads along the genome.  In a final consensus stage, the assembler builds an alignment of all the reads covering the genome and infers, as a consensus of the aligned reads, the original sequence of the genome being assembled.

Figure 9.  Overlap graph for a bacterial genome.  The thick edges in the picture on the left (a Hamiltonian cycle) correspond to the correct layout of the reads along the genome (figure on the right).  The remaining edges represent false overlaps induced by repeats (exemplified by the red lines in the figure on the right)

Eulerian path  -  Eulerian path approaches are based on early attempts to sequence genomes through a technique called sequencing by hybridization.  In this technique, instead of generating a set of reads, scientists identified all strings of length k (k-mers) contained in the original genome.   While this experimental method did not produce a viable alternative to Sanger sequencing, it led to the development of an elegant approach to sequence assembly.  This approach, also based on a graph-theoretic model, breaks up each read into a collection of overlapping k-mers.  Each k-mer is represented in a graph as an edge connecting two nodes corresponding to its k-1 bp prefix and suffix respectively.  It is easy to see that, in the graph containing the information obtain from all the reads, a solution to the assembly problem corresponds to a path in the graph that uses all the edges - an Eulerian path.  One advantage of the Eulerian approach is that repeats are immediately recognizable while in an overlap graph they are more difficult to identify. (figures to be added)

Align-layout-consensus -  As more and more genomes become available in public databases, it is increasingly the case that a completed genome exists that is closely related to the genome being assembled.  The assembly problem thus becomes easier as the relative placement of reads can be inferred from their alignment to the related genome (or reference), in a process called comparative assembly.  Thus, the overlap stage of assembly (often one of the most computationally intensive assembly tasks) is replaced by an alignment step.  The layout stage is also greatly simplified due to the additional constraints provided by the alignment to the reference.

BAC-by-BAC (hierarchical) sequencing - In order to avoid some of the complexity involved in assembling large genomes, scientists developed a hierarchical approach.  First, the genome is broken up into a collection of large fragments (between 40 and 200 kbp) called Bacterial Artificial Chromosomes or BACs.  The BACs location along the genome is then mapped using specialized laboratory experiments.  A minimal tiling path of BACs is chosen such that each base in the genome is covered by at least one BAC, and the overlap between BACs is minimized.  Each BAC is then sequenced through the standard shotgun method, the resulting assemblies being combined into an assembly for each chromosome using the information provided by the tiling paths (Figure 10).


BAC-by-BAC approach
Figure 10. BAC-by-BAC approach.  The long lines represent individual BACs.  The minimal tiling path is represented by thick lines.  Each BAC in the tiling path is then sequenced through the shotgun method.

Sequence alignment/Representing assemblies

A description of how assemblies are represented in some of the most common assembly format is presented here

 

 

Recommended assembly programs

DNA Baser Sequence Assembler

DNA sequence assembly and phylogenetic tree software

 

 

 

sequence assembly
contig assembly software