Question

Use of "grep" in RNAseq trimming

0

Entering edit mode

14 days ago

Luna_1215 • 0

I am pretty new to the RNAseq analysis environment and have been running human tumour experiments on Illumina Novoseq X (paired end 150bp).

I have found significant polyA presence after removal with Cutadapt of adapter sequences seen in my FASTQC report. The FASTQC, after adapter removal, suggests polyA sequences in ~50% of my reads. These persist despite activating the polyA trimming function within Cutadapt.

Screen shot from post adapter removal FASTQC report

I believe these are causing my mapping rate during alignment with STAR to fail (unique mapping rates ~15%, multi-mapping ~35% and unmapped ~50%).

Looking at some of my sequences, it appears the polyA runs are within my transcripts rather than at the ends. I have pulled out the transcripts with runs of >17 As (17 chosen as an arbitrary number). Code used to do so:

zgrep -E -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore2.txt

An example of the following output sequences from read 1 of one of my samples:

@LH00235:309:22GF2WLT4:6:1101:26887:1042 1:N:0:TNAGGCGA+GNGATCTA
GTNCAGTAGTTTTCGGTGTTTGTTAGTAAAGTTAGGAGGAGGAGGANGTTGCAGTGAGCCAAGATCGCGCCACTGCACTCCAGCCTAGATGACAGAGCAAGACGCTGTCTCAAAAAAAAAAAAAAAAAAAAAACGTCT>
+
II#-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9I9IIIIIIIIIIII-III9>
--
@LH00235:309:22GF2WLT4:6:1101:35699:1056 1:N:0:TNAGGCGA+GCGATCTA
GGNGTTGGGAGGATCATCTGAGCCTGGGAGGTCAAGGCTAAAGTGAGCCATGACGGTGCAGAAAAAAAAAAAAAAAAAACCAAAGTTCCAACTCTCTGGTCCCCCCAAGCCTTTGACTTCTTTTTTTCCCTCCCCAAA>
+
II#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIII-I-II-999-99-------9-9-9----9--9-9-9--9---99---I-----9--9-9>
--
@LH00235:309:22GF2WLT4:6:1101:45831:1056 1:N:0:TNAGGCGA+GCGATCTA
GGNTCCGGAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAACCCCCCACCCGCCCCTCCCCCCCCCCCCCCTCAACCCCCCCCCCCCCCCC>
+
I9#IIIIIIIIIIIIIIIIIIIIIIII99III9II9IIIIIII9IIIIIII9IIIII9IIIIII9IIIIIIIIIIIIIIIIIIII-II-III99-9-I--II--II9I9-I-I-9-I9-------999999----99->
--

I want to try and remove these polyA sequences and then try to re-align to troubleshoot and see if this is why my mapping rate is so low. To isolate the remaining sequences with <17 runs of PolyAs (again arbitrary number of 17), I have tried adding the -v option to my grep search in various ways:

    zgrep -v -E -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt

    zgrep -E -v -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt

    zgrep -vE -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt

    zgrep -v -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt

Unfortunately all of these attempts don't seem to isolate anything and return only the original fastq file, but in .txt format. Is there something I am missing in the code? Apologies if this is a basic question, I have tried searching this forum and others but can't seem to find an answer.

RNAseq PolyA • 292 views

ADD COMMENT • link updated 14 days ago by GenoMax 148k • written 14 days ago by Luna_1215 • 0

score 1 · Answer 1 · 2024-12-12

1

Entering edit mode

14 days ago

Pierre Lindenbaum 164k

looks like grep -v is ignored when using -A or -B .eg. see :

 echo "A.B.C.D.X.E.F.G" | tr "." "\n" | grep X -B1 -A1  -v

you could linearize with paste , and exclude the lines with awk

gunzip -c input.fastq.gz | paste - - - - | awk -F '\t' '(!($2 ~ /A{17,}/ ))' |tr "\t" "\n" | gzip > out.fq.gz

ADD COMMENT • link 14 days ago by Pierre Lindenbaum 164k

score 0 · Answer 2 · 2024-12-12

I would use cutadapt, not grep, and play with the parameters.

There are plenty of recipes to try out here to remove poly A https://cutadapt.readthedocs.io/en/v4.9/recipes.html

Also, the As are probably going to have low quality scores, so should be removed if you set the right filters ( I would hope the defaults would be ok to remove these!).

Other trimmers are trimmomatic and bbmaps' bbduk if you don't have much success.

score 0 · Answer 3 · 2024-12-12

0

Entering edit mode

14 days ago

GenoMax 148k

Please use a proper program that is aware of the fastq format. BBMap suite includes polyfilter.sh which will be the best option if you simply want to remove poly-something sequences. See this thread for more: New Illumina error mode, new BBTools release (39.09) to deal with it

ADD COMMENT • link 14 days ago by GenoMax 148k