Entering edit mode
4.7 years ago
usr2
▴
10
Hi,
I have a set of fastq reads that I would like to cluster, independent of read length.
Having the initial data:
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAA
AAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBB
BBBBBB
I would like the output of my data to be:
AAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBB
Do you know how would be the best way to implement it?
thanks
You can try
clumpify.sh
from BBMap suite withcontainment=t
option. Read more about clumpify here: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. There is also a guide available.If this does not keep the longest representation of identical reads, you could filter your reads with
reformat.sh
orbbduk.sh
(both from BBMap suite) with aminlength=
option.