Collapse Reads That Are Substrings Of Other Reads In Same Library
2
0
Entering edit mode
11.2 years ago

We have a file of millions of reads and would like to remove all reads that are substrings of other reads in the same library.

Are there any programs that does this, or do we have to code from scratch?

reads library • 2.7k views
ADD COMMENT
0
Entering edit mode

This is interesting, I'm curious what you hope to achieve by this.

ADD REPLY
0
Entering edit mode

Let's say you have a library of ACTG, ACTGA. When you try to make a clusters from this library, both the aforementioned reads will hit the same spot even though it is unlikely that a RNA cluster contains two almost identical reads in the same position.

ADD REPLY
6
Entering edit mode
11.2 years ago
Ido Tamir 5.2k

SGA rmdup can do this

>cat reads.fa
@r1
AGATTTTTAGGG
+
BBBBBBBBBBBB
@r2
TTTTTA
+
BBBBBB
@r3
TTTTTT
+
BBBBBB
@r4
AGATTTTTAGGG
+
BBBBBBBBBBBB

.

 >sga index reads.fa
 >sga rmdup reads.fa

.

==> reads.rmdup.dups.fa <==
>r2,seqrank=1 r2 NumDuplicates=1
TTTTTA
>r4,seqrank=3 r4 NumDuplicates=2
AGATTTTTAGGG

==> reads.rmdup.fa <==
>r1 r1 NumDuplicates=2
AGATTTTTAGGG
>r3 r3 NumDuplicates=1
TTTTTT
ADD COMMENT
0
Entering edit mode

Thank you. Appreciated. Ps. for me the reads.rmdup.dups.fa only contains TTTTTA for some reason.

ADD REPLY
0
Entering edit mode
11.2 years ago
Josh Herr 5.8k

See also the following posts which may be useful in your strategy: Duplicate Paired-End Illumina Reads, Remove Duplicate Reads From Fasta File, and Removing Duplicate Reads Post Alignment, and there are numerous posts on removing RNA-Seq reads too which may be valuable...

ADD COMMENT
0
Entering edit mode

These are all for +/- identical reads not for substrings

ADD REPLY
0
Entering edit mode

I am aware of that -- that's why I wrote "the following posts which may be useful in your strategy" -- am I wrong in assuming that there might be some similarity in context between identifying exact strings and matching substrings? Plus 1 on your answer above.

ADD REPLY

Login before adding your answer.

Traffic: 2140 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6