I'm in the need to filter our RNA-seq data for reads that contain a polyA stretch (e.g. with more than 6 A's). I need to recover those reads, not discard them.
The data is paired end and stranded. So, when dealing with paired reads in 2 files they should always be processed together, not one at a time.
Then, I would like to trim the polyA stretch and align to the reference genome only those reads that originally cointained a polyA tail.
I am aware that conventional RNA-seq is not suitable for identifying polyA sites precisely, but I am willing to try because we don't have any other data.
Does anybody know a tool that does this? I have tried bbduk.sh from BBMap suite for filtering and trimming, as suggested in this post. I used this command:
bbduk.sh in1=A1_R1.fastq.gz in2=A1_R2.fastq.gz outm1=A1_R1_polyA.fastq.gz outm2=A1_R2_polyA.fastq.gz literal=AAAAAA
However, I encountered some problems: 1) the output cointains 0 reads while it was supposed to contain the polyA reads with AAAAAA, and I can't figure it out why 2) how could I indicate that I want reads with a polyA having 6 to let's say 20 A's?
The purpose of the analysis is to identify alternative polyAdenylation events. I am working with a parasite genome that is not well annotated.
Maybe I'm drowning in a cup of water and there is a more straightforward solution, so any help will be appreciated.
Try adding
k=3
to allow initial matches.