Question

Rsamtools::pileup gives inconsistent number of reads

2

Entering edit mode

9.7 years ago

inesdesantiago ▴ 270

I have created a fasta file with 300 simulated reads at a given SNP position (reads are 75mer, and I simulate every position in the + and - strand, hence 75x4=300 reads). Then, I map them back to the ref genome using bowtie and then use pileup to see the alignment at the position in question:

bf <- BamFile(bamfile)
param <- ScanBamParam(which=GRanges("chr10",IRanges(start=100189138, end=100189138)))

quickBamFlagSummary shows that there are 300 reads

> quickBamFlagSummary(bf, param=param)
                                group |    nb of |    nb of | mean / max
                                   of |  records |   unique | records per
                              records | in group |   QNAMEs | unique QNAME
All records........................ A |      300 |      300 |    1 / 1
  o template has single segment.... S |      300 |      300 |    1 / 1
  o template has multiple segments. M |        0 |        0 |   NA / NA
      - first segment.............. F |        0 |        0 |   NA / NA
      - last segment............... L |        0 |        0 |   NA / NA
      - other segment.............. O |        0 |        0 |   NA / NA

Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
Indentation reflects this.

Details for group S:
  o record is mapped.............. S1 |      300 |      300 |    1 / 1
      - primary alignment......... S2 |      300 |      300 |    1 / 1
      - secondary alignment....... S3 |        0 |        0 |   NA / NA
  o record is unmapped............ S4 |        0 |        0 |   NA / NA

Pileup only shows 262 reads (63+62+75+62)... Any idea why? I don't use any threshold on min_base_quality and min_mapq...

> pileup(bf, scanBamParam=param, min_base_quality=0,min_mapq=0)
  seqnames       pos strand nucleotide count               which_label
1    chr10 100189138      +          A    63 chr10:100189138-100189138
2    chr10 100189138      -          A    62 chr10:100189138-100189138
3    chr10 100189138      +          G    75 chr10:100189138-100189138
4    chr10 100189138      -          G    62 chr10:100189138-100189138

I looked the region up with IGV, and also, IGV reports 300 reads.

I actually simulate several different positions, and the 262 reads (instead of 300) appear every time.. Not sure why. I am sure pileup is filtering the reads somehow..

Also, the same exact script works fine if I simulate 36mer reads (instead of 75...)! Any ideas why?

pileup rsamtools • 3.5k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by inesdesantiago ▴ 270

1

Entering edit mode

Could you post a BAM file just containing this region somewhere for debugging? Also, what's the output of sessionInfo()?

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

1

Entering edit mode

Actually, note that pileup() takes not only a scanBamParam option, but also a pileupParam option. Given that the latter defaults to some filtering, I wouldn't be surprised if that's the source of the problem.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you! it is solved by changing max_depth parameter within PileupParam()

> pileup(bf, scanBamParam=param, pileupParam=PileupParam(max_depth=8000, min_mapq=0, min_base_quality=0))
  seqnames       pos strand nucleotide count               which_label
1    chr10 100189138      +          A    75 chr10:100189138-100189138
2    chr10 100189138      -          A    75 chr10:100189138-100189138
3    chr10 100189138      +          G    75 chr10:100189138-100189138
4    chr10 100189138      -          G    75 chr10:100189138-100189138

I was adding my arguments without PileupParam constructor, this is wrong:

> pileup(bf, scanBamParam=param, min_base_quality=0,min_mapq=0, max_depth=8000)
  seqnames       pos strand nucleotide count               which_label
1    chr10 100189138      +          A    63 chr10:100189138-100189138
2    chr10 100189138      -          A    62 chr10:100189138-100189138
3    chr10 100189138      +          G    75 chr10:100189138-100189138
4    chr10 100189138      -          G    62 chr10:100189138-100189138

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by inesdesantiago ▴ 270

0

Entering edit mode

This is what samtools pileup command (from samtools) gives:

samtools mpileup -B -sf $genome.fa -r chr10:100189138-100189138 test_aln75.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr10    100189138    A    300    G$.$g$,$G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,^~G^~.^~g^~,    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Counting in python shows 150 reads in each allele - the reference ("A") and alternative ("G") (75 reads in each strand):

>>> a.count("G")
75
>>> a.count("g")
75
>>> a.count(".")
75
>>> a.count(",")
75

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by inesdesantiago ▴ 270

Ram · Accepted Answer · 2015-03-09

Thank you! it is solved by changing max_depth parameter within PileupParam()

> pileup(bf, scanBamParam=param, pileupParam=PileupParam(max_depth=8000, min_mapq=0, min_base_quality=0))
  seqnames       pos strand nucleotide count               which_label
1    chr10 100189138      +          A    75 chr10:100189138-100189138
2    chr10 100189138      -          A    75 chr10:100189138-100189138
3    chr10 100189138      +          G    75 chr10:100189138-100189138
4    chr10 100189138      -          G    75 chr10:100189138-100189138

I was adding my arguments without PileupParam constructor, this is wrong:

> pileup(bf, scanBamParam=param, min_base_quality=0,min_mapq=0, max_depth=8000)
  seqnames       pos strand nucleotide count               which_label
1    chr10 100189138      +          A    63 chr10:100189138-100189138
2    chr10 100189138      -          A    62 chr10:100189138-100189138
3    chr10 100189138      +          G    75 chr10:100189138-100189138
4    chr10 100189138      -          G    62 chr10:100189138-100189138