Question

Percentage Of Exons Covered

0

Entering edit mode

11.4 years ago

hellbio ▴ 520

Hi all,

I have a bed file with the start and end positions of all the exons. How can we find out, what percentage of these exons are covered in the sequenced sample?

Any help would be highly appreciated.

Thanks

exon coverage • 5.7k views

ADD COMMENT • link updated 11.4 years ago by Jeremy Leipzig 22k • written 11.4 years ago by hellbio ▴ 520

0

Entering edit mode

have you tried bedtools and bedops?

ADD REPLY • link 11.4 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 1 · 2013-06-25

2

Entering edit mode

11.4 years ago

Irsan ★ 7.8k

It depends a lot on how you define "covered". My suggestions:

An exon is covered when all bases have at least 1 read mapped to it. Or in other words, uncovered exons are those were there are 1 or more bases that have no reads mapped to them. In this case, use bedtools coverageBed -abam file.bam -b exons.bed
An exon is covered when the bases have, on average, N reads (I use 5 or more reads) or more mapped to them, where you can choose the threshold N yourself (opposed to bedtools coverageBed ...). Use picard-tools CalculateHsMetrics for this approach (you have to change your bed file a bit, but I'm sure you will manage to do that.

ADD COMMENT • link 11.4 years ago by Irsan ★ 7.8k

0

Entering edit mode

Well I have used bedtools:

samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed

function and the output looks like,

chr1    60817388        60817507        0       0       119     0.0000000
chr1    91226064        91226183        0       0       119     0.0000000
chr1    101711776       101711895       0       0       119     0.0000000
chr1    108003221       108003340       0       0       119     0.0000000
....

samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed -hist

Few lines of output:

chr1    60817388        60817507        32      1       119     0.0084034
chr1    60817388        60817507        33      2       119     0.0168067
chr1    60817388        60817507        34      3       119     0.0252101

Which output should be used to get the percentage of exons covered?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 11.4 years ago by hellbio ▴ 520

0

Entering edit mode

take the first output. Each line in the output corresponds to an exon. Consider the 7th column (the % of bases in the exon with non-zero coverage). If column 7 = 0, all bases in the exon have non-zero coverage, meaning that all bases have at least 1 read mapped to it. Count the amount of lines with column 7 = 0 and divide that by the total amount of lines:

awk 'BEGIN{FS="\t";totalExons=0;goodExons=0}{totalExons++;if($7==0){goodExons++}}END{print goodExons,"out of ",totalExons," have all bases covered by at least 1 read."}'

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 11.4 years ago by Irsan ★ 7.8k

0

Entering edit mode

Thanks, It gives the % of exons with atleast one read mapped to it. I would also like to find % of exons that have atleast 5,10,20 reads mapped to it. Could you suggest a way to do it or some script?

ADD REPLY • link 11.4 years ago by hellbio ▴ 520

Ram · Answer 2 · 2013-06-25

There is also the GATK DepthOfCoverage tool, which is what I use to get coverage data for exons with Exome-Seq projects now. I prefer its output over Picards metrics. You can set what you want reported by the tool, from there you can parse the tab delimited output based on whatever thresholds for coverage you want. I tend to look at the percentage of bases covered at at least 5x and 10x to identify regions of low coverage where I may have missed calling variants for instance. And obviously the zero coverage exons are easy to identify in the output.

A typical command line call for me is:

java -Xmx4g -jar /usr/local/bin/GenomeAnalysisTK.jar -T DepthOfCoverage -omitBaseOutput -omitLocusTable -R genome.fa -I A.bam -L Regions.list -o A.coverage -ct 5 -ct 10 -ct 15 -ct 20 -ct 30 -ct 50

GATK DepthOfCoverage

score 1 · Answer 3 · 2013-06-25

1

Entering edit mode

11.4 years ago

Jeremy Leipzig 22k

You can try Bamchop: https://github.com/CBMi-BiG/bamchop

ADD COMMENT • link 11.4 years ago by Jeremy Leipzig 22k

Ram · Answer 4 · 2013-06-25

0

Entering edit mode

11.4 years ago

Giovanni M Dall'Olio 28k

Have a look at the bedtools coverage function, and its documentation: http://bedtools.readthedocs.org/en/latest/content/tools/coverage.html

samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 11.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Dear Friend, I have a sample whose read count is 51,578,482. After Duplicate removal, i have about 46,393,168 reads with me!! Mapped read count is 46,676,207(99.36%) & on Target read count is 38,221,722 (81.36). Do u think it is a good output? What should be the ideal on target mappability in term of %. I have used Agilent V6+UTR, 150 PE

ADD REPLY • link 7.6 years ago by alok.helix ▴ 120

0

Entering edit mode

Your mapped and on-target percentages look good. What people will consider "good" for these statistics really depends on your experiment so there isn't a hard and fast cut-off value to use. But those numbers look fine.

ADD REPLY • link 7.6 years ago by DG 7.3k

0

Entering edit mode

Thanks, Dan, I also would like to ask the permissible duplication allowed? I am Having ~8% duplication of reads in my samples!! I guess this should be just fine. Any valuable comments??

ADD REPLY • link 7.6 years ago by alok.helix ▴ 120

0

Entering edit mode

The duplication rate also depends heavily on your experiment but since you are doing exon capture 8% seems reasonable. The metrics you have posted all seem reasonable to me and in line with what I normally expect.

ADD REPLY • link 7.6 years ago by DG 7.3k