cd hit for removing sequence redundancy
1
Hi all
I want to use cd hit to remove redundancy from file were collected from miRbase. It ia miRNA sequence. I need the command line for that.
Thanks
sequence
• 2.7k views
cat all_mapped.fastq | paste - - - - | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n' > file_out
time ./cd-hit -i file_out -o otput_cd_hit -M 8000 -T 3
then you go and examine each cluster with something like:
for i in *.clstr; do \
echo -n $(echo $i| cut -f 3 -d '_')" "; \
cut -f 1 $i | sort | uniq -c | awk '{val += $1; count +=1; \
if ($1 == 1) sing += 1 } END{ printf("cov: %.2f\tsingletons: \
%d\tuniq: %d\ttotal: %d\n", val/count,sing,count,val)
}'; \
done
Login before adding your answer.
What have you tried?
If you only want to deduplicate the sequences then
dedupe.sh
from BBMap may be much simpler to use.