How to dereplicate sequences in a Fasta file? Sequence dereplication of FASTA file via clustering

Fasta Sequence Dereplicator

The way bioinformatics programs should be...

Fasta Sequence Dereplicator is a Windows tool that allows you to dereplicate your sequences via sequence clustering.

Fasta Sequence Dereplicator is a graphic interface on top of CD Hit Est program.

About CD-Hit Sequence Dereplicator

CD-HIT is a bioinformatics tool for clustering and comparing protein or nucleotide sequences (FASTA). CD-HIT was originally developed by Dr. Weizhong Li.

CD-HIT uses a fast clustering algorithm and can handle extremely large databases. CD-HIT significantly reduces the efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.

The CD-HIT package includes many other tools but for the moment we offer a graphic interface only for CD-Hit Est:

CD-HIT-EST clusters similar DNAs into clusters that meet a user-defined similarity threshold.
CD-HIT (CD-HIT-EST) clusters similar proteins into clusters that meet a user-defined similarity threshold.
CD-HIT-2D (CD-HIT-EST-2D) compares 2 datasets and identifies the sequences in db2 that are similar to db1 above a threshold.
CD-HIT-454 identifies natural and artificial duplicates from pyrosequencing reads.
CD-HIT-OTU clusters rRNA tags into OTUs
CD-HIT-DUP identifies duplicates from single or paired Illumina reads
CD-HIT-LAP identifies overlapping reads

How sequence dereplication works?

A sequence dereplication tool will:

compare all the sequences in a data set (Fasta file) to each other
group similar sequences together
output a representative sequence from each group.

This way, duplicate sequences are removed from a library. Our Fasta Sequence Dereplicator program simplifies the dereplication of 16S rDNA sequence libraries and prepares the raw sequences for subsequent analyses.

The input is a protein dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.

Features

Free
Light, monolithic & portable (requires no installation, works under guest/limited accounts)
Requires no additional add-ons (Java, .Net, etc)
Easy to use/configurable GUI
Save/remember GUI state
Supported files: (multi) Fasta
Works on Windows XP, Windows Vista, Windows 7, Windows 8, Windows 10

Related tools

NCBI BLAST DB Downloader is a a freeware biology software tool that automates the NCBI BLAST DB download process.
NCBI Blaster (aka BLAST Robot) is a software tool that automates the NCBI BLAST search processes.
SFF/FastQ Sequence Workbench is an efficient and easy to use FastQ/SFF file viewer, editor, filter and converter.

How to dereplicate sequences in Fasta file? Dereplication of a FASTA file via clustering

DNA Fasta sequence dereplication via clustering

Download

Fasta Sequence Dereplicator
Version	1.2
Date	February 2016
Package size	~ 1.8 MB
Download time	less than 10 seconds
Fasta Sequence Dereplicator is now part of the Avalanche NextGen package. Please download this package.

Requirements

Few MB of disk space
< 30MB free RAM
No Java, no .Net

Portability

Fasta Sequence Dereplicator is really small so you can easily copy it on a floppy disk or USB flash stick and take it with you or send it to your colleagues via email.

Plans for next version

Job summary (graphic reports)
Multi-threading

Feedback/News list

This tool is . Please let us know how to make it better.

User manual