Question

samtools idxstats versus samtools view command

0

Entering edit mode

9.9 years ago

frida.danielsson ▴ 50

Hi,

I have mapped RNA-seq data to the human genome concatenated with a viral genome (26 chromosomes in total) with bowtie and need to get some numbers to calculate FPKM values manually for one viral gene, to retrieve the "total number of reads" which is needed for the denominator in the "FPKM" definition, I tried some different samtools commands on my indexed bam files but I don't really get the difference between them:

Can anyone explain the difference between the output from

samtools view input.bam | awk '{ if ($3=="Chr1") print }' | wc -l

versus..

samtools idxstats input.bam | head -n 26

bam RNA-Seq samtools • 8.1k views

ADD COMMENT • link updated 2.4 years ago by Samantha • 0 • written 9.9 years ago by frida.danielsson ▴ 50

0

Entering edit mode

samtools view input.bam | awk '{ if ($3=="Chr1") print }' | wc -l

faster:

samtools view -c input.bam Chr1

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

Hmm, that doesn't really answer my question..

ADD REPLY • link 9.9 years ago by frida.danielsson ▴ 50

1

Entering edit mode

that's why it's only a comment.

ADD REPLY • link 9.9 years ago by Pierre Lindenbaum 166k

Ram · Answer 1 · 2015-09-09

4

Entering edit mode

9.9 years ago

GouthamAtla 12k

From Manual:

Retrieve and print stats in the index file. The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.

idxstats prints the chr/scaffold name, length and the number of mapped reads.

The awk command above prints the number of reads mapped to Chr1. But idxstats is ultrafast compared to awk. idxstats uses the index file (.bai). hence its fast. Awk reads your entire bam file line by line, hence slow.

Try:

time samtools idxstats input.bam | grep -w "chr1"
time samtools view  input.bam |  awk '{ if ($3=="chr1") print }' | wc -l

time samtools view wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam chr1 | wc -l
179687

real    0m0.352s
user    0m0.340s
sys    0m0.048s

time samtools idxstats wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam | grep -w "chr1"
chr1    249250621    179687    0

real    0m0.055s
user    0m0.052s
sys    0m0.000s

time samtools view  wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam |  awk '{ if ($3=="chr1") print }' | wc -l
179687

real    0m2.371s
user    0m3.352s
sys    0m0.292s

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks for that answer, when I ran the two different commands I got somewhat different results (less number of mapped reads with the idxstats command), should I trust the awk command more as it goes through the file systematically line by line?

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 9.9 years ago by frida.danielsson ▴ 50

1

Entering edit mode

If you use "samtools view ..." to count records, make sure you set the flag -F4 to filter out unmapped reads.

ADD REPLY • link 9.9 years ago by dariober 15k

0

Entering edit mode

I don't think I need that, I use flagstat to get all info..

ADD REPLY • link 9.9 years ago by frida.danielsson ▴ 50

0

Entering edit mode

hi there,

I've run into a similar issue: quite different results (up to 2x the number of reads) using idxstats vs samtools view, with less number of mapped reads with idxstats, and similar to OP I don't really understand why there would be such a large difference and I'm not sure which command to trust.

Could you share more about:

how different your mapped reads were/some example data? (were they 2x different)?
if you figured out why this is the case?
which command would be more reliable or preferred? (is idxstats preferred just because of its speed?)

Thank you so much!

ADD REPLY • link 2.4 years ago by Samantha • 0