Should I exclude unmapped reads and proceed with the analysis for low mapped samples, or omit them altogether?
1
0
Entering edit mode
10 months ago
Sib ▴ 60

I am conducting RNAseq analysis on raw reads of Solanum Lycopersicum (tomato). I am aligning the raw reads of 24 samples to the reference genome obtained from Ensemble using STAR. I am achieving mapping rates higher than 90% in most samples. All samples have a mapping rate of more than 86%, except for three samples with mapping rates of 25.7%, 7.5%, and 11.3%. The unmapped reads are attributed to "too short", which, based on my research, seem to be related to rRNA contamination.

This is peculiar because, as far as I know, samples with rRNA contamination typically exhibit more than one peak in the Per Sequence GC content plot. However, my samples only show a single peak and pass this test!

Regardless, my primary goal is to conduct a differential expression analysis. It's not possible for me to redo sequencing. I am uncertain whether I can exclude unmapped reads from the BAM file and proceed with the analysis for these three samples, or if I should omit them from the analysis altogether.

STAR mapping RNAseq • 682 views
ADD COMMENT
0
Entering edit mode

The samples with low mapping rates can be discarded as they are likely to be contaminated.

ADD REPLY
2
Entering edit mode
10 months ago
dthorbur ★ 2.6k

There are a few steps to I normally take before deciding the remove a sample, but these 3 with low mapping are certainly good candidates.

I would look at the distribution of samples using some form of ordination analysis (i.e., PCA, NMDS, etc...), try to see what the unmapped reads are through using things like KRAKEN2 or BLAST, and check the number of reads that are still mapping in comparison to other samples.

If the low mapping samples cluster away from replicates it's pretty easy to justify removal, similar if they have only a small proportion of the mapped reads compared to the rest of your data as normalisation steps would destroy your other samples. The identification of what the reads likely are is more to help understand what went wrong - failed ribo-depletion, contamination, etc... All good things to know for future sample processing.

ADD COMMENT
0
Entering edit mode

Thank you for your response. However, PCA is typically plotted using normalized data. The low mapped sample has about 6 millions uniquely mapped reads and it's replicates have about 20 millions reads. Can it lead to a wrong normalization, If I employ the normalization method? If I normalize that way and subsequently generate a PCA plot and the low mapping samples do not cluster apart from the replicates, can I rely on this result and retain the low mapping samples?

ADD REPLY
0
Entering edit mode

You can just use presence/absence data on transcripts being mapped, or even PCA on raw data though that may be explaining abundance differences more than anything else in the first few PCAs

ADD REPLY

Login before adding your answer.

Traffic: 2219 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6