Use of "grep" in RNAseq trimming
3
0
Entering edit mode
14 days ago
Luna_1215 • 0

I am pretty new to the RNAseq analysis environment and have been running human tumour experiments on Illumina Novoseq X (paired end 150bp).

I have found significant polyA presence after removal with Cutadapt of adapter sequences seen in my FASTQC report. The FASTQC, after adapter removal, suggests polyA sequences in ~50% of my reads. These persist despite activating the polyA trimming function within Cutadapt.

Screen shot from post adapter removal FASTQC report

I believe these are causing my mapping rate during alignment with STAR to fail (unique mapping rates ~15%, multi-mapping ~35% and unmapped ~50%).

Looking at some of my sequences, it appears the polyA runs are within my transcripts rather than at the ends. I have pulled out the transcripts with runs of >17 As (17 chosen as an arbitrary number). Code used to do so:

zgrep -E -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore2.txt

An example of the following output sequences from read 1 of one of my samples:

@LH00235:309:22GF2WLT4:6:1101:26887:1042 1:N:0:TNAGGCGA+GNGATCTA
GTNCAGTAGTTTTCGGTGTTTGTTAGTAAAGTTAGGAGGAGGAGGANGTTGCAGTGAGCCAAGATCGCGCCACTGCACTCCAGCCTAGATGACAGAGCAAGACGCTGTCTCAAAAAAAAAAAAAAAAAAAAAACGTCT>
+
II#-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9I9IIIIIIIIIIII-III9>
--
@LH00235:309:22GF2WLT4:6:1101:35699:1056 1:N:0:TNAGGCGA+GCGATCTA
GGNGTTGGGAGGATCATCTGAGCCTGGGAGGTCAAGGCTAAAGTGAGCCATGACGGTGCAGAAAAAAAAAAAAAAAAAACCAAAGTTCCAACTCTCTGGTCCCCCCAAGCCTTTGACTTCTTTTTTTCCCTCCCCAAA>
+
II#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIII-I-II-999-99-------9-9-9----9--9-9-9--9---99---I-----9--9-9>
--
@LH00235:309:22GF2WLT4:6:1101:45831:1056 1:N:0:TNAGGCGA+GCGATCTA
GGNTCCGGAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAACCCCCCACCCGCCCCTCCCCCCCCCCCCCCTCAACCCCCCCCCCCCCCCC>
+
I9#IIIIIIIIIIIIIIIIIIIIIIII99III9II9IIIIIII9IIIIIII9IIIII9IIIIII9IIIIIIIIIIIIIIIIIIII-II-III99-9-I--II--II9I9-I-I-9-I9-------999999----99->
--

I want to try and remove these polyA sequences and then try to re-align to troubleshoot and see if this is why my mapping rate is so low. To isolate the remaining sequences with <17 runs of PolyAs (again arbitrary number of 17), I have tried adding the -v option to my grep search in various ways:

    zgrep -v -E -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt

    zgrep -E -v -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt

    zgrep -vE -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt

    zgrep -v -B 1 -A 2 'A{17,}' PILOT_Ar1_1012.fq.gz > PolyA_17ormore3.txt

Unfortunately all of these attempts don't seem to isolate anything and return only the original fastq file, but in .txt format. Is there something I am missing in the code? Apologies if this is a basic question, I have tried searching this forum and others but can't seem to find an answer.

RNAseq PolyA • 292 views
ADD COMMENT
1
Entering edit mode
14 days ago

looks like grep -v is ignored when using -A or -B .eg. see :

 echo "A.B.C.D.X.E.F.G" | tr "." "\n" | grep X -B1 -A1  -v

you could linearize with paste , and exclude the lines with awk

gunzip -c input.fastq.gz | paste - - - - | awk -F '\t' '(!($2 ~ /A{17,}/ ))' |tr "\t" "\n" | gzip > out.fq.gz
ADD COMMENT
0
Entering edit mode
14 days ago

I would use cutadapt, not grep, and play with the parameters.

There are plenty of recipes to try out here to remove poly A https://cutadapt.readthedocs.io/en/v4.9/recipes.html

Also, the As are probably going to have low quality scores, so should be removed if you set the right filters ( I would hope the defaults would be ok to remove these!).

Other trimmers are trimmomatic and bbmaps' bbduk if you don't have much success.

ADD COMMENT
0
Entering edit mode
14 days ago
GenoMax 148k

Please use a proper program that is aware of the fastq format. BBMap suite includes polyfilter.sh which will be the best option if you simply want to remove poly-something sequences. See this thread for more: New Illumina error mode, new BBTools release (39.09) to deal with it

ADD COMMENT

Login before adding your answer.

Traffic: 2034 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6