We have a file of millions of reads and would like to remove all reads that are substrings of other reads in the same library.
Are there any programs that does this, or do we have to code from scratch?
We have a file of millions of reads and would like to remove all reads that are substrings of other reads in the same library.
Are there any programs that does this, or do we have to code from scratch?
SGA rmdup can do this
>cat reads.fa
@r1
AGATTTTTAGGG
+
BBBBBBBBBBBB
@r2
TTTTTA
+
BBBBBB
@r3
TTTTTT
+
BBBBBB
@r4
AGATTTTTAGGG
+
BBBBBBBBBBBB
.
>sga index reads.fa
>sga rmdup reads.fa
.
==> reads.rmdup.dups.fa <==
>r2,seqrank=1 r2 NumDuplicates=1
TTTTTA
>r4,seqrank=3 r4 NumDuplicates=2
AGATTTTTAGGG
==> reads.rmdup.fa <==
>r1 r1 NumDuplicates=2
AGATTTTTAGGG
>r3 r3 NumDuplicates=1
TTTTTT
See also the following posts which may be useful in your strategy: Duplicate Paired-End Illumina Reads, Remove Duplicate Reads From Fasta File, and Removing Duplicate Reads Post Alignment, and there are numerous posts on removing RNA-Seq reads too which may be valuable...
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This is interesting, I'm curious what you hope to achieve by this.
Let's say you have a library of ACTG, ACTGA. When you try to make a clusters from this library, both the aforementioned reads will hit the same spot even though it is unlikely that a RNA cluster contains two almost identical reads in the same position.