This may seem like a weird question, but we need to filter our RNAseq data for reads that contain polyA. The data is stranded RNA-seq, 50 bp reads. Would it be easier to find these reads before or after alignment? To be clear, the 50 bp read needs to contain a stretch of polyA, not just come from a transcript containing polyA. Has anyone done this type of analysis?
Thanks! I have some additional questions then. Since it is stranded RNA-seq, the polyA will actually be stretches of
TTTTTT
right?Also I used
grep
to look for reads containing this, and many of theTTTTTT
stretches are in the middle of a read. It doesn't seem possible that the polyA could be surrounded by other sequence on both ends.If you are capturing second strand then yes. Past the
TTTTT
the sequence may be going into adapters. You can easily check that by trimming reads you filter and select.