Welcome to DNA BASER’s official web siteHow to dereplicate sequence is a Fasta file? Dereplication of a FASTA file via clusteringDNA sequence assemblyContact us if you need more information
FastQ convertor
Convert SFF file

Fasta Sequence Dereplicator

The way bioinformatics programs should be...





Fasta Sequence Dereplicator is a Windows tool that allows you to dereplicate your sequences via sequence clustering.

Fasta Sequence Dereplicator is a graphic interface on top of CD Hit Est program.


About CD-Hit Sequence Dereplicator


CD-HIT is a bioinformatics tool for clustering and comparing protein or nucleotide sequences (FASTA). CD-HIT was originally developed by Dr. Weizhong Li.

CD-HIT uses a fast clustering algorithm and can handle extremely large databases. CD-HIT significantly reduces the efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.

The CD-HIT package includes many other tools but for the moment we offer a graphic interface only for CD-Hit Est:

  • CD-HIT-EST clusters similar DNAs into clusters that meet a user-defined similarity threshold.
  • CD-HIT (CD-HIT-EST) clusters similar proteins into clusters that meet a user-defined similarity threshold.
  • CD-HIT-2D (CD-HIT-EST-2D) compares 2 datasets and identifies the sequences in db2 that are similar to db1 above a threshold.
  • CD-HIT-454 identifies natural and artificial duplicates from pyrosequencing reads.
  • CD-HIT-OTU clusters rRNA tags into OTUs
  • CD-HIT-DUP identifies duplicates from single or paired Illumina reads
  • CD-HIT-LAP identifies overlapping reads


How sequence dereplication works?


A sequence dereplication tool will:

  1. compare all the sequences in a data set (Fasta file) to each other
  2. group similar sequences together
  3. output a representative sequence from each group.

This way, duplicate sequences are removed from a library. Our Fasta Sequence Dereplicator program simplifies the dereplication of 16S rDNA sequence libraries and prepares the raw sequences for subsequent analyses.

The input is a protein dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.



  • Free
  • Light, monolithic & portable (requires no installation, works under guest/limited accounts)
  • Requires no additional add-ons (Java, .Net, etc)
  • Easy to use/configurable GUI
  • Save/remember GUI state
  • Supported files: (multi) Fasta
  • Works on Windows XP, Windows Vista, Windows 7, Windows 8, Windows 10

Related tools

  • NCBI BLAST DB Downloader is a a freeware biology software tool that automates the NCBI BLAST DB download process.
  • NCBI Blaster (aka BLAST Robot) is a software tool that automates the NCBI BLAST search processes.
  • SFF/FastQ Sequence Workbench is an efficient and easy to use FastQ/SFF file viewer, editor, filter and converter.

How to dereplicate sequences in Fasta file? Dereplication of a FASTA file via clustering


DNA Fasta sequence dereplication via clustering


FastQ dereplicate



Fasta Sequence Dereplicator
Version 1.2
Date February 2016
Package size ~ 1.8 MB
Download time less than 10 seconds
Fasta Sequence Dereplicator is now part of the Avalanche NextGen package. Please download this package.



  • Few MB of disk space
  • < 30MB free RAM
  • No Java, no .Net



Fasta Sequence Dereplicator is really small so you can easily copy it on a floppy disk or USB flash stick and take it with you or send it to your colleagues via email.


Plans for next version

  • Job summary (graphic reports)
  • Multi-threading


Feedback/News list


This tool is Nextgen SFF/fastq analysis software. Please let us know how to make it better.



SFF to FastQ to FASTA converter