Hi all,
Which tools is better to cluster illumina(Fastq) sequences? CD-HIT or UCLUST? I have 2 billion sequences.
Thanks, Deeps
Hi all,
Which tools is better to cluster illumina(Fastq) sequences? CD-HIT or UCLUST? I have 2 billion sequences.
Thanks, Deeps
Since you said in the comments that your goal is to remove duplicate reads, I'm going to address that issue. There is no need for clustering. If you are working with a reference genome, then you should map your reads to the reference and consider reads that map to the same genomic coordinate as duplicates. For example, you could use the "samtools rmdup" command. If you do not have a reference genome, then you are most likely going to want to use some kind of de novo assembly program based on de-Bruijn graphs, in which case you would correct sequencing errors at the kmer level, using something like khmer.
For the typical whole-genome or whole-transcriptome applications that Illumina sequencing is typically used for, clustering reads by sequence identity is not likely to produce a useful result, because in most cases the reads are expected to be tiled along much longer fragments of DNA, so a clustering approach would somewhat arbitrarily and randomly segment that longer stretch into poorly-separated "clusters" that each represent roughly one read-length of the larger fragment.
"Clustering is not likely to produce a useful result for short reads"
This is highly dependent on what type of data someone has. What you explained here assumes "whole genome shotgun" sequencing and you are absolutely correct that clustering raw data would be worthless and misleading, but if someone has amplicon data from a metagenome or a population study, it's absolutely essential to cluster short reads, or do some sort of analagous data comparison.
I meant the biological nature, not the technical specifications. It matters whether it's RNA-seq, genomc DNA-seq, exome seq, ChIP-seq, RIP-seq, metagenome sequencing, etc. Clustering might produce a useful result for some of those, but definitely not others. It doesn't really matter so much what the read length is or how many reads there are. What biological question are you trying to answer, and why do you think that clustering by sequence identity is the way to answer it?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What do you hope to accomplish by clustering?
Removing duplicate reads..
My sequences are nucleotide sequences. I have past experience with UCLUST for clustering 454 sequences. I believe UCLUST wont take fastq files as input. Which CD-HIT program I have to use? How can we give fastq files directly to CD-HIT?
What is the source of your data? Nucleotide, yes, but from a single organism or many organisms (metagenome)? Whole genome shotgun or amplicon marker amplification? All of these are important to know when deciding to either remove duplicate reads or cluster reads on similarity. It doesn't sound like you're ready to decide between two (UCLUST and CD-HIT) equally good (in my opinion) sequence clustering programs because you don't know why you want to cluster your reads.