Question

How To Calculate The Coverage Of A Specific Interval Based On Data From A Bam File

1

Entering edit mode

13.0 years ago

rehma.ar ▴ 290

Dear All!

i want to calculate the coverage of a specific fragment from a bam file. i am using this script right now

samtools view -u infile.bam chromosome:76437781-76439092 | samtools pileup - -cf ref.fasta | awk '{if ($2 >= 76437781 || $2 <= 76439092) print }' | awk '{ count++ ; SUM += $8 } END { if (count==0)print SUM  "\t" count "\t" "0";else print "\t" SUM "\t" count "\t" SUM/count }' > out.file

out put is something like that:

coverage 16418 1501 10.938

"76437781-76439092" i am giving these coordinates manually in the shell. i know i can have variables in shell but i want to use a bed file to give the coordinates and put it in the loop to get the coverage of all my fragments in one file. i think i have to do this in perl,this looks quite difficult to me to use three files 1)bam file 2)refrence fasta 3)bed file with coordinates can i do this in perl. or any other easier way.

thanks in advance

samtools awk perl • 9.1k views

ADD COMMENT • link updated 13.0 years ago by Sukhi Singh 11k • written 13.0 years ago by rehma.ar ▴ 290

Obi Griffith · Answer 1 · 2012-08-28

7

Entering edit mode

13.0 years ago

Sukhi Singh 11k

I calculate coverage using genomeCoverageBed.

So, if your file has read positions of which coverage has to be calculated in a file like

cat reads
chr1:3000239-3001041
chr1:3001239-3003041

and you want to calculate coverage from a bam (mapped reads) file, you can use this pipe, I just made and tested.

while read line; 
do samtools view -b a.bam $line | \
bamToBed -i stdin | \
genomeCoverageBed -bg -i stdin -g ~/src/useFul/ucsc/genomeIndex/mm9.chrom ;done < reads > coverage

We are reading the

chromosome co-ordinates line by line,
pulling the respective data out of bam file,
converting them to a bed file and
calculating the coverage of that.

You need to have samtools, bedtools installed for it. You also need the organism's genome co-ordinates and the bam file needs to be indexed. Index using samtools index file.bam

Cheers

ADD COMMENT • link updated 13.0 years ago by Obi Griffith 20k • written 13.0 years ago by Sukhi Singh 11k

0

Entering edit mode

sorry if i did not conveyed myself properly. i got your point but you are getting the coverage while i want to calculate the average coverage of the whole fragment between the coordinates as you can see in the question i am using awk '{ count++ ; SUM += $8 to do this. that i think you can not do with genomeCoverageBed.

ADD REPLY • link 13.0 years ago by rehma.ar ▴ 290

0

Entering edit mode

Check this post, might help you, in that case your script is fine, just reformat a bit using loop or try in R. http://www.biostars.org/post/show/5165/tools-to-calculate-average-coverage-for-a-bam-file/

ADD REPLY • link 13.0 years ago by Sukhi Singh 11k

score 2 · Answer 2 · 2012-08-28

2

Entering edit mode

13.0 years ago

Johan ▴ 890

The DepthOfCoverage Walker in the gatk supports this. Here's a usage example:

java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R [reference fasta] -I [bam file] -L [bed file]

ADD COMMENT • link 13.0 years ago by Johan ▴ 890

0

Entering edit mode

Hi Johan, I am having trouble using this over a provided interval file. Although it runs fine if I omit the -L file. The error that I get is: A USER ERROR has occurred (version 1.3-20-g447e9bf):

<h5>ERROR The invalid arguments or inputs must be corrected before the GATK can proceed</h5> <h5>ERROR Please do not post this error to the GATK forum</h5> <h5>ERROR</h5> <h5>ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.</h5> <h5>ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki</h5> <h5>ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa</h5> <h5>ERROR</h5> <h5>ERROR MESSAGE: File associated with name /Users/Desktop//myinterval.bed is malformed: Interval file could not be parsed in any supported format. caused by BED files must be parsed through Tribble; parsing them as intervals through the GATK engine is no longer supported</h5>

The format of the file is: chr1:XXXX-XXXXXX also tried 1:XXXXX-XXXXXX

Do you know what the problem might be? Thanks!

ADD REPLY • link 12.5 years ago by datanerd ▴ 520

0

Entering edit mode

You seem to be using a very old version of the GATK - 1.3-20-g447e9bf. The current version is 2.3-9, try downloading and running with the latest version and see if it works. You can download it from here: http://www.broadinstitute.org/gatk/download

ADD REPLY • link 12.5 years ago by Johan ▴ 890