Hi everybody, I have a set of RNA-seq (single end files). After mapping with STAR (himan genome)
STAR --genomeDir /index --readFilesCommand zcat --readFilesIn file1.fastq.gz --runThreadN 16 --outSAMtype BAM SortedByCoordinate --outWigType bedGraph
I have a huge % of reads mapped to multiple loci
Number of input reads | 49078760
Average input read length | 74
UNIQUE READS:
Uniquely mapped reads number | 9806908
Uniquely mapped reads % | 19.98%
Average mapped length | 73.42
Number of splices: Total | 1022832
Number of splices: Annotated (sjdb) | 985106
Number of splices: GT/AG | 1005417
Number of splices: GC/AG | 6850
Number of splices: AT/AC | 496
Number of splices: Non-canonical | 10069
Mismatch rate per base, % | 0.94%
Deletion rate per base | 0.02%
Deletion average length | 2.33
Insertion rate per base | 0.01%
Insertion average length | 1.22
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 36919747
% of reads mapped to multiple loci | 75.23%
Number of reads mapped to too many loci | 186914
% of reads mapped to too many loci | 0.38%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 3.97%
% of reads unmapped: other | 0.45%
I would be grateful for any suggestions.
First thing to check is if those are rRNA reads.
How can I check it?
You can find entire sequence of human rDNA repeat here. Use it with bbduk.sh.
Using BBDuk.
Was anything suspicious in the FastQC report? Have you checked for rRNA or other contamination?
For RNA-Seq, I'd use the ENCODE settings described in the STAR manual. This parameter set worked well for me in various experiments.