Here is one of my result from STAR alignment. As you could see, Uniquely mapped reads: 32.37% and % of reads mapped to multiple loci: 60.74% This sounds not that bad because more than 90% of my reads were aligned well. However, when I process bam file with featureCounts, Successfully assigned alignments rate is 10.1%
Here are more details about commands and program that I used.
- fastq downloaded from SRA: SRP050036 using fasterq-dump (only used SRR1657054 for test)
ENSEMBL GRCh 38 reference file created with STAR
FASTA: http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
GTF: http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.gtf.gz
STAR --runThreadN 64 --runMode genomeGenerate --genomeDir [output folder name] \ --genomeFastaFiles [Homo_sapiens.GRCh38.dna.toplevel.fa path] \ --sjdbGTFfile [Homo_sapiens.GRCh38.104.gtf path]
STAR aligned (version: 2.7.9a)
STAR --runThreadN 32 --genomeDir [ENSEMBL reference file path] \ --readFilesIn [SRR1657054.fastq path] \ --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts \ --outFileNamePrefix [output file name]
Extract count matrix using featureCounts
featureCounts -T 10 -s 0 -a [Homo_sapiens.GRCh38.104.gtf path]\ -o [output file name] \ [SRR1657054Aligned.sortedByCoord.out.bam path]
(I tried -s 1 and -s 2 also, but didn't get better result)
Q1) % of Uniquely mapped reads is 32.37%, and I thought % of successfully assigned alignments will be close to 32.37%, but only 10.1%. Why is this happening? (I used same annotation GTF file)
Q2) How should I deal with high % of reads mapped to multiple loci? According to featureCounts manual,
Multi-mapping reads will also be counted. For a multi-mapping read, all its reported alignments will be counted. The 'NH' tag in BAM/SAM input is used to detect multi-mapping reads.
So if I use -M option then one read may counted more than once? What should I do about it?
The dataset looks like it's total RNA so those reads are probably rRNA... why don't you check where the multi-mappers are aligning to the genome? If so then I would just proceed with unique mappers.