high % of reads mapped to multiple loci after STAR mapping
1
4
Entering edit mode
7.1 years ago
Lila M ★ 1.3k

Hi everybody, I have a set of RNA-seq (single end files). After mapping with STAR (himan genome)

STAR --genomeDir /index --readFilesCommand zcat --readFilesIn file1.fastq.gz --runThreadN 16 --outSAMtype BAM SortedByCoordinate --outWigType bedGraph

I have a huge % of reads mapped to multiple loci

                          Number of input reads |   49078760
                      Average input read length |   74
                                    UNIQUE READS:
                   Uniquely mapped reads number |   9806908
                        Uniquely mapped reads % |   19.98%
                          Average mapped length |   73.42
                       Number of splices: Total |   1022832
            Number of splices: Annotated (sjdb) |   985106
                       Number of splices: GT/AG |   1005417
                       Number of splices: GC/AG |   6850
                       Number of splices: AT/AC |   496
               Number of splices: Non-canonical |   10069
                      Mismatch rate per base, % |   0.94%
                         Deletion rate per base |   0.02%
                        Deletion average length |   2.33
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.22
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   36919747
             % of reads mapped to multiple loci |   75.23%
        Number of reads mapped to too many loci |   186914
             % of reads mapped to too many loci |   0.38%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   3.97%
                     % of reads unmapped: other |   0.45%

I would be grateful for any suggestions.

RNA-Seq STAR single-end • 19k views
ADD COMMENT
3
Entering edit mode

First thing to check is if those are rRNA reads.

ADD REPLY
0
Entering edit mode

How can I check it?

ADD REPLY
2
Entering edit mode

You can find entire sequence of human rDNA repeat here. Use it with bbduk.sh.

ADD REPLY
1
Entering edit mode

Using BBDuk.

ADD REPLY
1
Entering edit mode

Was anything suspicious in the FastQC report? Have you checked for rRNA or other contamination?

For RNA-Seq, I'd use the ENCODE settings described in the STAR manual. This parameter set worked well for me in various experiments.

ADD REPLY
4
Entering edit mode
7.1 years ago
Hussain Ather ▴ 990

Reads that map to more than 10 loci are counted as mapping to too many loci. You can change --outFilterMultimapNmax to increase the threshold. Make sure to also increase winAnchorMultimapNmax if you increase it to greater than 50.

ADD COMMENT
0
Entering edit mode

And that may be too many to begin with. OP is not trying to make the multi-mapping go away :)

ADD REPLY
0
Entering edit mode

I've added the options:

STAR --genomeDir /index --readFilesCommand zcat --readFilesIn file.fastq.gz  --runThreadN 16 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --alignIntronMax 500000 --alignMatesGapMax 1000000 --outSAMtype BAM SortedByCoordinate --outWigType bedGraph

But the results are the same (any other parameter to include?) ... so I will try with BBDuk...what options do I have if those are rRNA?

ADD REPLY
2
Entering edit mode

You can to ignore those reads (they won't be counted by featureCounts/STAR, if an entry is not in your GTF and/or since they are multi-mapped). That said, if you have 75+% rRNA then it may be time to start thinking about doing this experiment over.

ADD REPLY
1
Entering edit mode

There are more than 20 (though I don't know how many more) rRNA positions on the assembled human genome. I got rRNA mappings when I set --outFilterMultimapNmax 200, though it is possible you will get them with less than 200. Did you check for rRNA contamination with BBDuk, anyway?

P.S.: I completely agree with genomax.

That said, if you have 75+% rRNA then it may be time to start thinking about doing this experiment over.

ADD REPLY
0
Entering edit mode

yes, I did

bbduk.sh in1=file.fastq.gz outm=ribo.fa outu=nonribo.fa ref=human_ribosomal.fa

been human_ribosomal.fa

and I get:

Input:                      49078760 reads      3667028636 bases.
Contaminants:               38486994 reads (78.42%)     2895991336 bases (78.97%)
Total Removed:              38486994 reads (78.42%)     2895991336 bases (78.97%)
Result:                     10591766 reads (21.58%)     771037300 bases (21.03%)

So for sure they are rRNA right?

ADD REPLY
0
Entering edit mode

Ouch! Are all samples like this or just this one?

ADD REPLY
0
Entering edit mode

All of them I think :( How bad is to work with ~10.000.000 reads?

ADD REPLY
1
Entering edit mode

No harm in giving it a try. Don't expect to find anything significant if the difference is expected to be subtle.

BTW: Were these samples not ribo-depleted or that process did not work well? Do all samples have similar level of rRNA?

ADD REPLY
0
Entering edit mode

Yes, all of them have a similar contamination level because of the kit (I think). One more question, regarding the command line

bbduk.sh in1=file.fastq.gz outm=ribo.fa outu=nonribo.fa ref=human_ribosomal.fa

which is better, working with the file.fastq.gz (for STAR/salmon) or use directly nonribo.fa to be sure that all the contamination has been removed?

ADD REPLY
1
Entering edit mode

Just to make sure. This isn't an experiment about rRNA correct? Otherwise we would be wrong.

If you have already done STAR alignments you could ignore rRNA counts since the alignment % more or less appears to match bbduk's evaluation. You could use nonribo.fa, if you were going to redo the alignments.

ADD REPLY
0
Entering edit mode

Nop, is an experiment about nascent RNA :) . Thank you very much for all the information and all the tips that have been very very useful!!!

ADD REPLY
0
Entering edit mode

That may need to include rRNA. If you are just analyzing the data you may want to check with the folks who generated the data.

ADD REPLY
0
Entering edit mode

Yes, they agreed about rRNA contamination as well....

ADD REPLY
1
Entering edit mode

Then you are all set for the next step.

ADD REPLY
0
Entering edit mode

Regarding this answer, apparently STAR has not ignore the rRNA, because after generating the bam file and run samtools view Aligned.sortedByCoord.out.bam | cut -f1 |sort | uniq |wc -l I get a total of 44795631 that I assume are the mapped reads. So STAR is not ignoring at all the rRNA, rigth?

ADD REPLY
0
Entering edit mode

STAR outputs unmapped reads to the bam file, and your samtools view is showing all reads from the bam.

ADD REPLY
0
Entering edit mode

so how can I get only the mapped reads?

ADD REPLY
1
Entering edit mode

STAR appears to include unmapped reads in the BAM (unless --outReadsUnmapped option is used). So follow @h.mon's suggestions below to filter your files, if you need to separate the mapped reads.

ADD REPLY
0
Entering edit mode

I think so, because I've used the Genome sequence (GRCh38.p10) and Comprehensive gene annotation (gencode.v27.annotation.gtf)

ADD REPLY
1
Entering edit mode

See my edits to the post above.

You could remove the lines for rRNA from the GTF file when you do the counting with featureCounts or if you used STAR to get counts directly then the relevant lines for rRNA counts could be deleted from the count matrix.

ADD REPLY
0
Entering edit mode

I think that is easier if I use the non_ribosomal.fastq.gz output from bbduk to do the mapping and perform the downstream analysis.

ADD REPLY
1
Entering edit mode

That would work. Decontaminated datasets are much smaller so the alignments would complete faster. Consider using the option for unmapped reads if you don't want them in your alignments but they are not really doing any harm, as far as counting goes.

ADD REPLY
0
Entering edit mode

I've tried, but as I am using the whole genome, the rRNA should be included in the gtf because the size is exactly the same for both bam files.

ADD REPLY
1
Entering edit mode

If you use decontaminated files (non_ribo.fq.gz) then you should have only 25% of the original reads in them so the resulting BAM's should be smaller, even when unaligned reads are included. rRNA should not matter now since most reads would not align to it. Am I missing something?

ADD REPLY
0
Entering edit mode

sorry, my mistake. I've run STAR with --outReadsUnmapped in the original fastq.gz file, and the result is the same I get when run STAR without that parameter, so this is why I'm thinking that the genome that I am using for mapping should include rRNA. I hope a much smaller bam file for non_ribo.fq.gz

ADD REPLY
1
Entering edit mode

That is not making sense in terms of size of the BAM. They should be smaller with decontaminated data. Did you get nothing in the unmapped files? That means most (all?) of your decontaminated data maps.

ADD REPLY
1
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6