samtools idxstats versus samtools view command
1
0
Entering edit mode
9.2 years ago

Hi,

I have mapped RNA-seq data to the human genome concatenated with a viral genome (26 chromosomes in total) with bowtie and need to get some numbers to calculate FPKM values manually for one viral gene, to retrieve the "total number of reads" which is needed for the denominator in the "FPKM" definition, I tried some different samtools commands on my indexed bam files but I don't really get the difference between them:

Can anyone explain the difference between the output from

samtools view input.bam | awk '{ if ($3=="Chr1") print }' | wc -l

versus..

samtools idxstats input.bam | head -n 26
bam RNA-Seq samtools • 7.5k views
ADD COMMENT
0
Entering edit mode
samtools view input.bam | awk '{ if ($3=="Chr1") print }' | wc -l

faster:

samtools view -c input.bam Chr1
ADD REPLY
1
Entering edit mode

Hmm, that doesn't really answer my question..

ADD REPLY
1
Entering edit mode

that's why it's only a comment.

ADD REPLY
4
Entering edit mode
9.2 years ago

From Manual:

Retrieve and print stats in the index file. The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.

idxstats prints the chr/scaffold name, length and the number of mapped reads.

The awk command above prints the number of reads mapped to Chr1. But idxstats is ultrafast compared to awk. idxstats uses the index file (.bai). hence its fast. Awk reads your entire bam file line by line, hence slow.

Try:

time samtools idxstats input.bam | grep -w "chr1"
time samtools view  input.bam |  awk '{ if ($3=="chr1") print }' | wc -l
time samtools view wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam chr1 | wc -l
179687

real    0m0.352s
user    0m0.340s
sys    0m0.048s
time samtools idxstats wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam | grep -w "chr1"
chr1    249250621    179687    0

real    0m0.055s
user    0m0.052s
sys    0m0.000s
time samtools view  wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam |  awk '{ if ($3=="chr1") print }' | wc -l
179687

real    0m2.371s
user    0m3.352s
sys    0m0.292s
ADD COMMENT
0
Entering edit mode

Thanks for that answer, when I ran the two different commands I got somewhat different results (less number of mapped reads with the idxstats command), should I trust the awk command more as it goes through the file systematically line by line?

ADD REPLY
1
Entering edit mode

If you use "samtools view ..." to count records, make sure you set the flag -F4 to filter out unmapped reads.

ADD REPLY
0
Entering edit mode

I don't think I need that, I use flagstat to get all info..

ADD REPLY
0
Entering edit mode

hi there,

I've run into a similar issue: quite different results (up to 2x the number of reads) using idxstats vs samtools view, with less number of mapped reads with idxstats, and similar to OP I don't really understand why there would be such a large difference and I'm not sure which command to trust.

Could you share more about:

  • how different your mapped reads were/some example data? (were they 2x different)?
  • if you figured out why this is the case?
  • which command would be more reliable or preferred? (is idxstats preferred just because of its speed?)

Thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 807 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6