Clustering ilumina reads with different lengths

0

Entering edit mode

4.7 years ago

usr2 ▴ 10

Hi,

I have a set of fastq reads that I would like to cluster, independent of read length.

Having the initial data:

AAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAA

AAAAAAA

BBBBBBBBBBBBBBBBBBBBBBBBB

BBBBBBBBBBBBB

BBBBBB

I would like the output of my data to be:

AAAAAAAAAAAAAAAAAAAAAAAAA

BBBBBBBBBBBBBBBBBBBBBBBBB

Do you know how would be the best way to implement it?

thanks

clustering sequencing • 460 views

ADD COMMENT • link 4.7 years ago by usr2 ▴ 10

0

Entering edit mode

You can try clumpify.sh from BBMap suite with containment=t option. Read more about clumpify here: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. There is also a guide available.

If this does not keep the longest representation of identical reads, you could filter your reads with reformat.sh or bbduk.sh (both from BBMap suite) with a minlength= option.

ADD REPLY • link 4.7 years ago by GenoMax 147k

Login before adding your answer.