I am pretty new to the RNAseq analysis environment and have been running human tumour experiments on Illumina Novoseq X (paired end 150bp).
I have found significant polyA presence after removal with Cutadapt of adapter sequences seen in my FASTQC report. The FASTQC, after adapter removal, suggests polyA sequences in ~50% of my reads. These persist despite activating the polyA trimming function within Cutadapt.
I believe these are causing my mapping rate during alignment with STAR to fail (unique mapping rates ~15%, multi-mapping ~35% and unmapped ~50%).
Looking at some of my sequences, it appears the polyA runs are within my transcripts rather than at the ends. I have pulled out the transcripts with runs of >17 As (17 chosen as an arbitrary number). Code used to do so:
zgrep -E -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore2.txt
An example of the following output sequences from read 1 of one of my samples:
@LH00235:309:22GF2WLT4:6:1101:26887:1042 1:N:0:TNAGGCGA+GNGATCTA
GTNCAGTAGTTTTCGGTGTTTGTTAGTAAAGTTAGGAGGAGGAGGANGTTGCAGTGAGCCAAGATCGCGCCACTGCACTCCAGCCTAGATGACAGAGCAAGACGCTGTCTCAAAAAAAAAAAAAAAAAAAAAACGTCT>
+
II#-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9I9IIIIIIIIIIII-III9>
--
@LH00235:309:22GF2WLT4:6:1101:35699:1056 1:N:0:TNAGGCGA+GCGATCTA
GGNGTTGGGAGGATCATCTGAGCCTGGGAGGTCAAGGCTAAAGTGAGCCATGACGGTGCAGAAAAAAAAAAAAAAAAAACCAAAGTTCCAACTCTCTGGTCCCCCCAAGCCTTTGACTTCTTTTTTTCCCTCCCCAAA>
+
II#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIII-I-II-999-99-------9-9-9----9--9-9-9--9---99---I-----9--9-9>
--
@LH00235:309:22GF2WLT4:6:1101:45831:1056 1:N:0:TNAGGCGA+GCGATCTA
GGNTCCGGAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAACCCCCCACCCGCCCCTCCCCCCCCCCCCCCTCAACCCCCCCCCCCCCCCC>
+
I9#IIIIIIIIIIIIIIIIIIIIIIII99III9II9IIIIIII9IIIIIII9IIIII9IIIIII9IIIIIIIIIIIIIIIIIIII-II-III99-9-I--II--II9I9-I-I-9-I9-------999999----99->
--
I want to try and remove these polyA sequences and then try to re-align to troubleshoot and see if this is why my mapping rate is so low. To isolate the remaining sequences with <17 runs of PolyAs (again arbitrary number of 17), I have tried adding the -v option to my grep search in various ways:
zgrep -v -E -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt
zgrep -E -v -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt
zgrep -vE -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt
zgrep -v -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt
Unfortunately all of these attempts don't seem to isolate anything and return only the original fastq file, but in .txt format. Is there something I am missing in the code? Apologies if this is a basic question, I have tried searching this forum and others but can't seem to find an answer.