Question

Collapse Reads That Are Substrings Of Other Reads In Same Library

0

Entering edit mode

11.2 years ago

Click downvote ▴ 720

We have a file of millions of reads and would like to remove all reads that are substrings of other reads in the same library.

Are there any programs that does this, or do we have to code from scratch?

reads library • 2.7k views

ADD COMMENT • link updated 11.2 years ago by Josh Herr 5.8k • written 11.2 years ago by Click downvote ▴ 720

0

Entering edit mode

This is interesting, I'm curious what you hope to achieve by this.

ADD REPLY • link 11.2 years ago by Devon Ryan 104k

0

Entering edit mode

Let's say you have a library of ACTG, ACTGA. When you try to make a clusters from this library, both the aforementioned reads will hit the same spot even though it is unlikely that a RNA cluster contains two almost identical reads in the same position.

ADD REPLY • link 11.2 years ago by Click downvote ▴ 720

score 6 · Answer 1 · 2013-09-11

6

Entering edit mode

11.2 years ago

Ido Tamir 5.2k

SGA rmdup can do this

>cat reads.fa
@r1
AGATTTTTAGGG
+
BBBBBBBBBBBB
@r2
TTTTTA
+
BBBBBB
@r3
TTTTTT
+
BBBBBB
@r4
AGATTTTTAGGG
+
BBBBBBBBBBBB

.

 >sga index reads.fa
 >sga rmdup reads.fa

.

==> reads.rmdup.dups.fa <==
>r2,seqrank=1 r2 NumDuplicates=1
TTTTTA
>r4,seqrank=3 r4 NumDuplicates=2
AGATTTTTAGGG

==> reads.rmdup.fa <==
>r1 r1 NumDuplicates=2
AGATTTTTAGGG
>r3 r3 NumDuplicates=1
TTTTTT

ADD COMMENT • link 11.2 years ago by Ido Tamir 5.2k

0

Entering edit mode

Thank you. Appreciated. Ps. for me the reads.rmdup.dups.fa only contains TTTTTA for some reason.

ADD REPLY • link 11.2 years ago by Click downvote ▴ 720

score 0 · Answer 2 · 2013-09-11

0

Entering edit mode

11.2 years ago

Josh Herr 5.8k

See also the following posts which may be useful in your strategy: Duplicate Paired-End Illumina Reads, Remove Duplicate Reads From Fasta File, and Removing Duplicate Reads Post Alignment, and there are numerous posts on removing RNA-Seq reads too which may be valuable...

ADD COMMENT • link 11.2 years ago by Josh Herr 5.8k

0

Entering edit mode

These are all for +/- identical reads not for substrings

ADD REPLY • link 11.2 years ago by Ido Tamir 5.2k

0

Entering edit mode

I am aware of that -- that's why I wrote "the following posts which may be useful in your strategy" -- am I wrong in assuming that there might be some similarity in context between identifying exact strings and matching substrings? Plus 1 on your answer above.

ADD REPLY • link 11.2 years ago by Josh Herr 5.8k