DNA Sequence Assembler->Manual->Settings window. General settings

Assembly settings

Here you can instruct DNA Baser how to assemble your samples and how to clean their untrusted regions. If the quality of your samples is average or good, the default assembly parameters will give good results so usually you don't have to change these parameters.

Automatic setup

We have try to make our product as easy to use as possible. In most cases you won't have to manually tweak the assembly parameters. Just use the "Optimize for samples with" function below and DNA Baser will choose the appropriate parameters the Trimming Engine and Assembler Engine.

assembly settings - preset parameters

Manual setup

If your samples are of low quality DNA Baser may clean them too much and the overlapping region between two sequences will be lower than 25 bases (this is the default value for 'Minimum overlap' parameter). Also for low quality samples, the sequencing errors may result in low identity between the sequences. In both cases the sequences will also not assemble correctly (you will see lots of mismatches) or they will not assemble at all. When this happens you will need to manually adjust the Trimming Engine and Assembler Engine parameters. Relax these parameters and try again.

ASSEMBLER ENGINE

The parameters of the ASSEMBLER ENGINE are word size, identity percent and minimum overlap (see figure above). These settings are very important and changing them will greatly affect the accuracy of the assembly process.

WORD SIZE (window size) - for 2 sequences to enter the assembly process, they need to have in common a region of perfect identity. WORD SIZE represents the minimum length of this region.

IDENTITY PERCENT - for two sequences to form a contig, they need to have an overlapping region. The IDENTITY PERCENT represents the minimum percentage of identity that this region can to have.

MINIMUM OVERLAP - for two sequences to form a contig, they need to have an overlapping region. The MINIMUM OVERLAP represents the minimum length that this overlapping region can to have. The value for the MINIMUM OVERLAP can not be lower than that of the WORD SIZE, so, the software automatically adjusts it whenever the WORD SIZE value changes.

TRIMMING ENGINE

DNA Sequence Assembler automatically removes the untrusted regions from sample files whenever it imports the samples from disk. Here you can see a Video tutorial showing how the automatic trimming of the chromatogram untrusted regions works.

Setting the parameters of the TRIMMING ENGINE

There are three parameters that will determine how much of the ends of the chromatograms will be recognized as untrusted regions (see figure below). The first is the percentage of of good bases, the second is the window size and third is the threshold confidence score.

The threshold confidence score establishes the value for which the bases are considered as being correctly recognized by the base caller. The bad end recognition algorithm is moving along the sequence in size defined units of bases, called windows. The size of these units can be changed by the user, as indicated in the figure above. When the percentage of good bases in such a window is lower than the established threshold, the window is marked as untrusted. The window will be moved along the sequence until the percentage of the good bases will be at least equal with the established threshold. When these happens, the software will check the first bases in the window, and it will mark them as untrusted until a first good base is found.

DNA Sequence Assembler creates an imaginary window (see the light-blue rectangle in the picture below) and it will place this window at the beginning of the sequence. This window will be 18 bases wide. If 75% of the bases inside this window are good, the DNA Baser has found the first high quality region in your sample and it will stop the trimming process. If the above condition was not met, it will move the window one base to the left and it will repeat the process until the condition is met.

The figure below represents an example of how the trimming algorithm works. In this case, the window size is 10 bases, the percentage of good bases is 75 and the confidence score threshold is 25. Trusted bases are marked in green, while untrusted bases are marked in red. All bases at the left of the vertical line will be marked as untrusted.

automatic trimming engine for DNA sequences
Fig 1. The trimming algorithm on action.
The window is marked in light blue, bad bases are marked in red; good (trusted) bases are marked in green.

Error correction

DNA Baser Assembler has an internal algorithm that allows it to automatically make decisions based on the confidence score (confidence scores) of the peaks in your chromatogram files. If the confidence score information is missing, DNA Baser will consider all peaks as having maximum quality (100). The confidence score information it is important for the trimming engine and error correction. If your sequencing machine is able to produce both SCF and ABI files then WE STRONGLY recommend you to use the SCF format instead of ABI.

More assembly parameters

Other parameters important for sequence assembly are the ASSEMBLY SCORES. Both these parameters can be accessed from the SETTINGS WINDOW, ASSEMBLER tab.

SciVance Technologies

Support Online Manual