Question

Low mapping percentage

0

Entering edit mode

2.0 years ago

Sib ▴ 70

Hello, biostars. What are the steps that you do when you get a mapping percentage below 70% with STAR software while mapping with a human reference genome? I want an overall instruction that works for most human samples. Based on my searches, I propose the following instruction but I know it has some deficiencies. I would be grateful if you could complete it:

1- If Per sequence GC content result of fastQC has two or multiple picks it is probable that my data has contamination.I should BLAST 10-15 unmapped reads to find the source of contamination.

2- If I also have overrepresented sequences, I should BLAST them to find the source of contamination. The contamination source might be rRNA or DNA contamination or contamination from other organisms.

3-.....

About 2 I don’t know what should I do in the case of each contamination source. Should I remove rRNA contamination? What about DNA contamination and contamination from other organisms?

mapping STAR RNAseq • 1.4k views

ADD COMMENT • link 2.0 years ago by Sib ▴ 70

1

Entering edit mode

If you do have contamination (either from rRNA of same species or from true contaminant(s)) there is not much you can do about that at this stage. If you are curious about finding an explanation as to what the contaminants are then you can do all the above keeping in mind that this will not change your mapping % result.

Unless you have wildly varying alignment % for the pool of samples you can use the alignment counts you have and move on with DE analysis.

ADD REPLY • link 2.0 years ago by GenoMax 151k

0

Entering edit mode

Thank you for the reply GenoMax . If there is not much I can do about the contamination at this stage, what is the benefit of knowing the source of contamination?
I'm using RNAseq data from public databases and the data is not mine, therefore, I can not redo the experiment and prevent contamination. So, can I just simply remove the low-mapped samples? Or should I continue analysis and calculate counts for the whole genes and then remove genes related to the contamination?

ADD REPLY • link 2.0 years ago by Sib ▴ 70

1

Entering edit mode

I've been faced with similar issues in the past. Both option 1 and 2 seem useful. You can also treat your data as accidental metagenomic data and use a metagenomics classification method like Kraken2 to query your reads against a database. Having a reasonable idea of what species are likely present in your reads might help you figure out what went wrong.

ADD REPLY • link 2.0 years ago by Dave Carlson ★ 2.1k