Question

Generate percent of bases with quality > x from BAM file

0

Entering edit mode

9.7 years ago

Andrew ▴ 60

I am interesting in finding the percent of bases with quality > x from a BAM file. I want to do this multiple times for a BAM file. So for example I might want to find percent of bases with quality > 0 and also percent of bases with quality > 50.

I know I could parse data from the BAM file using samtools view, but that is really slow. I could also use samtools mpileup, but again it is slow and generates more information that I need.

Are there any tools out there that can achieve this?

bam • 2.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Andrew ▴ 60

0

Entering edit mode

Did you try FASTQC? I think it can reads BAM files http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Pierre Lindenbaum 166k

Ram · Answer 1 · 2015-07-24

Using my tool bioalcidae: https://github.com/lindenb/jvarkit/wiki/BioAlcidae

The javascript file:

var qual2count={};
var total=0.0;
var maxqual=0;
while(iter.hasNext())
    {
    var rec = iter.next();

    var quals = rec.getBaseQualities();


    for(var i=0;i< quals.length;++i)
        {
        var qual = quals[i];
        if( maxqual <= qual) maxqual=qual+1;
        if(!(qual in qual2count)) qual2count[qual]=0;
        qual2count[qual]++;
        total++;
        }
    }
for(var i=0;i< maxqual;++i)
    {
    var n=0;
    for(var j=i;j< maxqual;++j)
        {
        if(!(j in qual2count)) continue;
        n += qual2count[j];
        }
    out.println("qual("+i+")\t"+(100.0* (n/total)));
    }

Invoke:

$ java -jar dist-1.133/bioalcidae.jar -f input.js samtools-0.1.19/examples/ex1.bam

qual(0)    100
qual(1)    99.77434771044436
qual(2)    99.77434771044436
qual(3)    99.77348971694795
qual(4)    99.75632984701977
qual(5)    99.6036070046589
qual(6)    99.10854475723075
qual(7)    98.96869181731603
qual(8)    98.80138308551622
qual(9)    98.55513895204673
qual(10)    98.30975281207368
qual(11)    97.57788435963656
qual(12)    97.32906624567785
qual(13)    97.01418263249565
qual(14)    96.65039339001811
qual(15)    96.35867560123894
qual(16)    96.01976817015728
qual(17)    95.50154009832606
qual(18)    94.98588600698406
qual(19)    94.39043852047601
qual(20)    93.61996036070046
qual(21)    92.94729345951558
qual(22)    91.90483136137829
qual(23)    89.93316230662971
qual(24)    88.37676210414325
qual(25)    86.75601238942609
qual(26)    83.89889404638313
qual(27)    74.77584919906307
qual(28)    5.675626978747501
qual(29)    0.14671688788598983
qual(30)    0.01801786342459524
qual(31)    0.0025739804892278917

score 2 · Answer 2 · 2015-07-26

If speed is an issue, I think the easiest way to deal with large bam files is to parse a read every n-th. After all once you have collected hundreds of thousands of nearly random reads you have a pretty good approximation of the quality distribution. This simple python script should do the trick:

Example usage, parse one every 100th reads:

samtools view aln.bam | ./baseQualityDistr.py 100

Code:

#!/usr/bin/env python

import sys
import collections

docstring= """
Generate percent of bases with quality > x from BAM file.
Read sam file from stdin. Optionally parse every n-th read.
USAGE
samtools view aln.bam | baseQualityDistr.py [parse-every-nth-read]
"""

if len(sys.argv) > 1 and sys.argv[1] in ['-h', '--help']:
    print(docstring)
    sys.exit()

if len(sys.argv) == 1:
    nth= 1
else:
    nth= int(sys.argv[1])

quals= collections.Counter()
n= 0
for line in sys.stdin:
    n += 1
    if n % nth == 0:
        qstr= line.split('\t')[10]
        for b in qstr:
            quals[b] += 1

q33= dict()
for k in quals:
    q33[ord(k) - 33]= quals[k]

q33sorted= collections.OrderedDict(sorted(q33.items(), reverse= True))
cdf= dict()
cumsum= 0
tot= 0
for k in q33sorted:
    cumsum += q33sorted[k]
    tot += q33sorted[k]
    cdf[k]= cumsum
cdfSorted= collections.OrderedDict(sorted(cdf.items(), reverse= True))

tot= float(tot)
print('Q33\tcumsum\tpct')
for k in cdfSorted:
    print('\t'.join([str(k), str(cdfSorted[k]), str(100 * cdfSorted[k] / tot)]))

sys.exit()