I don't know how to get end-1 vs end-2 alignment counts from STAR logs, I hope someone out there does! In the meantime, the manual says this (emphasis mine):
The statistics are calculated for each read (single- or paired-end)
and then summed or averaged over all reads. Note that STAR counts a
paired-end read as one read, (unlike the samtools flagstat/idxstats,
which count each mate separately).
For splices I am also confused, here is an example I was just working on from Drosophila using 300bp paired-end MiSeq reads. Log.final.out shows this:
Number of splices: Total | 362050
Number of splices: Annotated (sjdb) | 347797
Number of splices: GT/AG | 350494
Number of splices: GC/AG | 1618
Number of splices: AT/AC | 92
Number of splices: Non-canonical | 9846
Lines 3-6 sum to line 1, but that is where the math ceases to work. The manual goes on to say (emphasis mine),
Most of the information
is collected about the UNIQUE mappers (unlike samtools flagstat/idxstats which does not
separate unique or multi-mappers). Each splicing is counted in the numbers of splices, which
would correspond to summing the counts in SJ.out.tab. The mismatch/indel error rates are
calculated on a per base basis, i.e. as total number of mismatches/indels in all unique mappers
divided by the total number of mapped bases.
I have tried various approaches to counting the values in columns 7 and 8 of SJ.out.tab -- they do not add up to 362050, or 350494, or anything like them. I have also tried a wide variety of approaches to regenerate the splicing numbers in the Log.final.out from Aligned.out.bam, and cannot do it. The very closest I could get was taking primary alignments from unique mappers and counting the total number of splices in the cigars:
samtools view -F 256 Aligned.out.bam | grep "NH:i:1" | cut -f6 | awk '{ while (sub(/N/,"")) j++ } END { print j }'
Which gets 362267 -- not 362050, but very close, closer than the other dozen approaches I tried. Mind you, this is the number of recorded splices, NOT the number of reads. For the reads,
samtools view -F 256 Aligned.out.bam | grep "NH:i:1" | cut -f6 | grep N | wc -l
Which is only 292903.
I hope this provides some insight -- and I hope someone else can give a better answer!
Thanks apa@stowers for the very detailed response, that helps.
Yes, I found the spliced alignment particularly confusing. And, I also tried samtools flagstat for stats on read1/read2 separately, seems it would not give you the number same as in STAR log output. I.e., the total number of reads in each of the two fastqs is 2,679,927, and what samtools flagstat shows is 2746705 for read1 and 2746088 for read2, which are bigger than the total read number :(, I guess that might be largely due to samtools does not differentiate uniquely mapped vs. multi-mapped reads? or something else?
Thanks again apa@stowers, and if someone else has such experience, please advise.