Hello, Can anyone help me with the command to run CD-HIT for clustering the aseembled metagenomic data.? And I also need to know, how may I remove redundant sequences from the assembly using CD-HIT.?
Hello, Can anyone help me with the command to run CD-HIT for clustering the aseembled metagenomic data.? And I also need to know, how may I remove redundant sequences from the assembly using CD-HIT.?
To cluster (put similar reads "together") you can start with this:
cd-hit-est -i reads.fa -o output.fa -c 0.95 -n 10 -d 999 -M 0 -T 0
For more info see https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHITEST
The option -c
declares the global sequence identity so in this example all reads that are 95% similar will be put together. For redundancy removal I guess you need to put this on -c 1
BUT! Keep in mind that this is a global alignment so for example the following reads:
>read1
AAAA
>read2
AAAAA
Are not 100% the same. So what means redundancy in your case?
The output (output.fa) will contain the representative sequences. In practice (sort of) cd-hit first sorts your input based on the length of the reads of your input fasta. After that it will go trough the sorted reads from top till bottom. So at the very first read there are no clusters yet, so this will be the representative read for the first cluster. If the second read is minimal 95% similar it will be part of that first cluster and if it is not 95% similar it will be a new cluster. Lets say those two reads are similar, then in your output file you will get only 1 sequence. So the redundancy is removed.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
what have you tried so far? (eg reading the manual or paper)?
on the redundancy part: CD-HIT will automatically merge (and thus remove) redundant sequences, so you don't need to do anything special for that.
ah, and do follow up on your earlier questions as well (quite similar to this one apparently) : Removing Contigs and Redundant Sequences.
I have read the User's guide but there are so many option that confusing me, thanks. And sorry, I will follow up to my previous question.