Hi all,
I have a bed file with the start and end positions of all the exons. How can we find out, what percentage of these exons are covered in the sequenced sample?
Any help would be highly appreciated.
Thanks
Hi all,
I have a bed file with the start and end positions of all the exons. How can we find out, what percentage of these exons are covered in the sequenced sample?
Any help would be highly appreciated.
Thanks
It depends a lot on how you define "covered". My suggestions:
Well I have used bedtools:
samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed
function and the output looks like,
chr1 60817388 60817507 0 0 119 0.0000000
chr1 91226064 91226183 0 0 119 0.0000000
chr1 101711776 101711895 0 0 119 0.0000000
chr1 108003221 108003340 0 0 119 0.0000000
....
samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed -hist
Few lines of output:
chr1 60817388 60817507 32 1 119 0.0084034
chr1 60817388 60817507 33 2 119 0.0168067
chr1 60817388 60817507 34 3 119 0.0252101
Which output should be used to get the percentage of exons covered?
take the first output. Each line in the output corresponds to an exon. Consider the 7th column (the % of bases in the exon with non-zero coverage). If column 7 = 0, all bases in the exon have non-zero coverage, meaning that all bases have at least 1 read mapped to it. Count the amount of lines with column 7 = 0 and divide that by the total amount of lines:
awk 'BEGIN{FS="\t";totalExons=0;goodExons=0}{totalExons++;if($7==0){goodExons++}}END{print goodExons,"out of ",totalExons," have all bases covered by at least 1 read."}'
There is also the GATK DepthOfCoverage tool, which is what I use to get coverage data for exons with Exome-Seq projects now. I prefer its output over Picards metrics. You can set what you want reported by the tool, from there you can parse the tab delimited output based on whatever thresholds for coverage you want. I tend to look at the percentage of bases covered at at least 5x and 10x to identify regions of low coverage where I may have missed calling variants for instance. And obviously the zero coverage exons are easy to identify in the output.
A typical command line call for me is:
java -Xmx4g -jar /usr/local/bin/GenomeAnalysisTK.jar -T DepthOfCoverage -omitBaseOutput -omitLocusTable -R genome.fa -I A.bam -L Regions.list -o A.coverage -ct 5 -ct 10 -ct 15 -ct 20 -ct 30 -ct 50
You can try Bamchop: https://github.com/CBMi-BiG/bamchop
Have a look at the bedtools coverage function, and its documentation: http://bedtools.readthedocs.org/en/latest/content/tools/coverage.html
samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed
Dear Friend, I have a sample whose read count is 51,578,482. After Duplicate removal, i have about 46,393,168 reads with me!! Mapped read count is 46,676,207(99.36%) & on Target read count is 38,221,722 (81.36). Do u think it is a good output? What should be the ideal on target mappability in term of %. I have used Agilent V6+UTR, 150 PE
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
have you tried bedtools and bedops?