Question

Confused regarding the conflicting information on pre-processing RNA-seq

0

Entering edit mode

8.7 years ago

GLFR ▴ 10

Hello!

I'm new to RNA-seq analysis and I'm trying to make sure that I'm following the best pre-processing for my whole transcriptome sequences. I've been reading up on the pros/cons of trimming RNA-Seq data and I'm finding that some recommend trimming with parameters such as Q20-30, however others call this type of trimming "aggresive" and says this can significanty affect mapping and therefore the downstream analysis. Others recommend if aggresive trimming is carried out, then a minimum length should be used (e.g. L36) and others recommend gentle/no trimming for RNA seq data. I understand there is no "best recipe" for trimming data as it's dependant on the sequence itself, so below I've given a general outline of my FastQC output for my 9 samples and I was hoping I could have some recomendations/pointers?

My R_1 sequences have more warnings and fails than my R_2 sequences. R1 generally fails or has warnings on per base sequence content, sequence duplication levels, overrepresented sequences and Kmer Content. Theres the typical illumina 3' end read decline in quality and 5' end dip in quality due to random hexamer priming (I'm guessing). In one sample I do have some contamination that's come up as an overrepresented sequence which when blasted brings up "Synthetic construct RNA control ERCC-00004 that obviously needs to be removed, all other overrepresented sequences appear to be transcription-related. All of my samples fail on Kmer content and either fail or give warning per-base sequence content. R2 generally has the same issues as R1 except theres no overrepresented sequences and the per base sequence quality tends to always remain in the "green zone" with phred scores of 28+ despite the 3' and 5' dip. All sequences pass on "Adaptor content" but I can see on the chart that there is a small amount (about 2%) of illumina universal adaptor sequence present from about 50bp onwards.

Just to see what would happen without trimming I mapped my reads to a reference genome using TopHat and got some pretty poor results back ( 70.1% overall read mapping rate and 50.2% concordant pair alignment rate). I understand rRNA contamination can cause some problems, so I used SeqMonk to check the TopHat output and it didn't detect any rRNA sequences. I understand using "CollectRNASeqMetrics" from picard tools would also be a good idea but I'm struggling to find a way to convert the genome annotation file to the required refFlat format as I'm working with a non-UCSC genome (A.thaliana).

Sequence info: Illumina HiSeq PE reads 2X100bp

I was thinking of using Trimmomatic for any trimming. I know I need to remove that contaminant sequence and the small amount of adaptor sequence but beyond that I'm a little unsure of the most appropriate course of action and would very much appreciate any advice.

Thank you very much in advance,

Gill

RNA-Seq QC analysis • 3.2k views

ADD COMMENT • link 8.7 years ago by GLFR ▴ 10

0

Entering edit mode

Pre-processing recommendations will probably vary according to which downstream analysis you will perform, and this you didn't say.

For gene expression, I would recommend adapter trimming, lenient or no quality trimming, rRNA contamination estimation. What is the cause of your discordant alignment rate - bad orientation, mapping to different contigs? Are you mapping to genome or transcriptome?

For mapping evaluation, an easy to use tool is QualiMap.

ADD REPLY • link 8.7 years ago by h.mon 35k

0

Entering edit mode

Hi h.mon,

Thank you so much for your help. You guessed correctly, I am indeed doing comparative gene expression analysis, apologies for not specifying. I will take a look at QualiMap, thank you for the suggestion. I'm honestly not sure of the reason behind the discordant alignment rate and I don't know how to check. Would QualiMap help here? I simply mapped my reads to tophat and looked at the alignment stats. I then used SeqMonk to look for rRNA contamination and it reported there was none present in the alignment. I was mapping to the genome, specifically TAIR10.

Once again, I'm very grateful for your help. I'm currently in the learning process as an MRes student.

Gill

ADD REPLY • link 8.6 years ago by GLFR ▴ 10

0

Entering edit mode

Just remind if you add ERCC-spikein in your library preparation (guessing from your blast result), you need to map reads to both transcriptome and ERCC-spikein. Also, what's the deduplicated percentage of sequences reported by Fastqc?

ADD REPLY • link 8.7 years ago by GZ1995 ▴ 410

0

Entering edit mode

Hi eldronzhou,

Thank you very muh for your help. Unfortunately I wasn't involved in the library prep I was simply handed the RNA-Seq data to QC and map to a reference genome, so I assume from the BLAST results that's what happened. Would mapping to the genome and ERCC-spikein be O.K or would it have to be the transcriptome specifically?

The deduplicated percentage varies by sample, most of which areabout 30% but it ranges from 18.6-58.25%.

Once again, thank you very much I'm very grateful for your help,

Gill

ADD REPLY • link 8.6 years ago by GLFR ▴ 10

0

Entering edit mode

I usually map to transcriptome plus ERCC-spikein. But it is OK to map to the transcriptome if you don't need ERCC-spikein for normalization or other purpose. The deduplication percentage is a little bit low from my experience, which means either you have some highly expressed genes dominate your library or there is an issue of PCR over amplification. The latter one may due to too many PCR cycles for lowly input RNA. I'm not sure whether this is the cause of your problem. As suggested by h.mon, you can use samtools flagstat to see what happens to your disconcordant reads.

ADD REPLY • link 8.6 years ago by GZ1995 ▴ 410

0

Entering edit mode

Brilliant I'll give that a go. Thank you so much for your help and insight, I really appreciate it.

ADD REPLY • link 8.6 years ago by GLFR ▴ 10