I have mapped RNA-seq data to the human genome concatenated with a viral genome (26 chromosomes in total) with bowtie and need to get some numbers to calculate FPKM values manually for one viral gene, to retrieve the "total number of reads" which is needed for the denominator in the "FPKM" definition, I tried some different samtools commands on my indexed bam files but I don't really get the difference between them:
Can anyone explain the difference between the output from
Retrieve and print stats in the index file. The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.
idxstats prints the chr/scaffold name, length and the number of mapped reads.
The awk command above prints the number of reads mapped to Chr1. But idxstats is ultrafast compared to awk. idxstats uses the index file (.bai). hence its fast. Awk reads your entire bam file line by line, hence slow.
Try:
time samtools idxstats input.bam | grep -w "chr1"
time samtools view input.bam | awk '{ if ($3=="chr1") print }' | wc -l
time samtools view wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam chr1 | wc -l
179687
real 0m0.352s
user 0m0.340s
sys 0m0.048s
time samtools idxstats wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam | grep -w "chr1"
chr1 249250621 179687 0
real 0m0.055s
user 0m0.052s
sys 0m0.000s
time samtools view wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam | awk '{ if ($3=="chr1") print }' | wc -l
179687
real 0m2.371s
user 0m3.352s
sys 0m0.292s
Thanks for that answer, when I ran the two different commands I got somewhat different results (less number of mapped reads with the idxstats command), should I trust the awk command more as it goes through the file systematically line by line?
I've run into a similar issue: quite different results (up to 2x the number of reads) using idxstats vs samtools view, with less number of mapped reads with idxstats, and similar to OP I don't really understand why there would be such a large difference and I'm not sure which command to trust.
Could you share more about:
how different your mapped reads were/some example data? (were they 2x different)?
if you figured out why this is the case?
which command would be more reliable or preferred? (is idxstats preferred just because of its speed?)
faster:
Hmm, that doesn't really answer my question..
that's why it's only a comment.