Question

What Is A Better Choice For Clustering Illumina Sequences Cd-Hit Or Uclust?

0

Entering edit mode

12.5 years ago

deepthithomaskannan ▴ 420

Hi all,

Which tools is better to cluster illumina(Fastq) sequences? CD-HIT or UCLUST? I have 2 billion sequences.

Thanks, Deeps

illumina clustering • 6.9k views

ADD COMMENT • link updated 12.5 years ago by Ryan Thompson ★ 3.7k • written 12.5 years ago by deepthithomaskannan ▴ 420

1

Entering edit mode

What do you hope to accomplish by clustering?

ADD REPLY • link 12.5 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

Removing duplicate reads..

ADD REPLY • link 12.5 years ago by deepthithomaskannan ▴ 420

0

Entering edit mode

My sequences are nucleotide sequences. I have past experience with UCLUST for clustering 454 sequences. I believe UCLUST wont take fastq files as input. Which CD-HIT program I have to use? How can we give fastq files directly to CD-HIT?

ADD REPLY • link 12.5 years ago by deepthithomaskannan ▴ 420

1

Entering edit mode

What is the source of your data? Nucleotide, yes, but from a single organism or many organisms (metagenome)? Whole genome shotgun or amplicon marker amplification? All of these are important to know when deciding to either remove duplicate reads or cluster reads on similarity. It doesn't sound like you're ready to decide between two (UCLUST and CD-HIT) equally good (in my opinion) sequence clustering programs because you don't know why you want to cluster your reads.

ADD REPLY • link 12.5 years ago by Josh Herr 5.8k

score 3 · Answer 1 · 2013-03-10

3

Entering edit mode

12.5 years ago

Ryan Thompson ★ 3.7k

Since you said in the comments that your goal is to remove duplicate reads, I'm going to address that issue. There is no need for clustering. If you are working with a reference genome, then you should map your reads to the reference and consider reads that map to the same genomic coordinate as duplicates. For example, you could use the "samtools rmdup" command. If you do not have a reference genome, then you are most likely going to want to use some kind of de novo assembly program based on de-Bruijn graphs, in which case you would correct sequencing errors at the kmer level, using something like khmer.

For the typical whole-genome or whole-transcriptome applications that Illumina sequencing is typically used for, clustering reads by sequence identity is not likely to produce a useful result, because in most cases the reads are expected to be tiled along much longer fragments of DNA, so a clustering approach would somewhat arbitrarily and randomly segment that longer stretch into poorly-separated "clusters" that each represent roughly one read-length of the larger fragment.

ADD COMMENT • link 12.5 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

"Clustering is not likely to produce a useful result for short reads"

This is highly dependent on what type of data someone has. What you explained here assumes "whole genome shotgun" sequencing and you are absolutely correct that clustering raw data would be worthless and misleading, but if someone has amplicon data from a metagenome or a population study, it's absolutely essential to cluster short reads, or do some sort of analagous data comparison.

ADD REPLY • link 12.5 years ago by Josh Herr 5.8k

0

Entering edit mode

I am dealing with the sequences which are not aligned to reference genome.

ADD REPLY • link 12.5 years ago by deepthithomaskannan ▴ 420

0

Entering edit mode

Could you elaborate on the nature of your data, then?

ADD REPLY • link 12.5 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

2billion,Single read sequences,100bp length, illumina nucleotide sequences

ADD REPLY • link 12.5 years ago by deepthithomaskannan ▴ 420

0

Entering edit mode

I meant the biological nature, not the technical specifications. It matters whether it's RNA-seq, genomc DNA-seq, exome seq, ChIP-seq, RIP-seq, metagenome sequencing, etc. Clustering might produce a useful result for some of those, but definitely not others. It doesn't really matter so much what the read length is or how many reads there are. What biological question are you trying to answer, and why do you think that clustering by sequence identity is the way to answer it?

ADD REPLY • link 12.5 years ago by Ryan Thompson ★ 3.7k