Question

Identifying RNA-seq reads containing polyA stretch

0

Entering edit mode

3.1 years ago

MACRODER ▴ 10

I'm in the need to filter our RNA-seq data for reads that contain a polyA stretch (e.g. with more than 6 A's). I need to recover those reads, not discard them.

The data is paired end and stranded. So, when dealing with paired reads in 2 files they should always be processed together, not one at a time.

Then, I would like to trim the polyA stretch and align to the reference genome only those reads that originally cointained a polyA tail.

I am aware that conventional RNA-seq is not suitable for identifying polyA sites precisely, but I am willing to try because we don't have any other data.

Does anybody know a tool that does this? I have tried bbduk.sh from BBMap suite for filtering and trimming, as suggested in this post. I used this command:

bbduk.sh in1=A1_R1.fastq.gz in2=A1_R2.fastq.gz outm1=A1_R1_polyA.fastq.gz outm2=A1_R2_polyA.fastq.gz literal=AAAAAA

However, I encountered some problems: 1) the output cointains 0 reads while it was supposed to contain the polyA reads with AAAAAA, and I can't figure it out why 2) how could I indicate that I want reads with a polyA having 6 to let's say 20 A's?

The purpose of the analysis is to identify alternative polyAdenylation events. I am working with a parasite genome that is not well annotated.

Maybe I'm drowning in a cup of water and there is a more straightforward solution, so any help will be appreciated.

RNA-seq polyA read • 1.5k views

ADD COMMENT • link updated 3.1 years ago by Istvan Albert 101k • written 3.1 years ago by MACRODER ▴ 10

1

Entering edit mode

Try adding k=3 to allow initial matches.

ADD REPLY • link 3.1 years ago by GenoMax 147k

score 1 · Answer 1 · 2021-10-18

1

Entering edit mode

3.1 years ago

Istvan Albert 101k

I would use a tool like cutadapt or fastp designed to trim fastq data.

With those tools you have the option to not only trim the data but select the reads that have that adapter, then you can align and check that these events do happen in the same locations.

https://cutadapt.readthedocs.io/en/stable/

https://github.com/OpenGene/fastp

ADD COMMENT • link 3.1 years ago by Istvan Albert 101k

0

Entering edit mode

That is what OP is doing with bbduk.sh. Using it in filter mode to select reads. Just not using it with right parameters.

ADD REPLY • link 3.1 years ago by GenoMax 147k

1

Entering edit mode

the problem with bbduk.sh is is that it does so many different things and neither is all that well documented. Exceedingly easy to misuse (as the question shows). The OP would be better served to keep up with the times.

ADD REPLY • link 3.1 years ago by Istvan Albert 101k