Hi,
I have an exome BAM file and a BED file of CCDS. How can I find out what fraction of my CCDS is covered at a depth higher than a threshold? Eg., I want to know what fraction of CCDS is above 100X.
You can sort and index your bam file, then count the number of reads located in each region in bed file.
A simple example:
samtools sort exome.bam -o exome.sort.bam
samtools index exome.sort.bam
old_IFS=$IFS
IFS=$'\n'
for line in `cat CCDS.bed`; do
region=`echo $line | awk '{printf "%s:%d-%d", $1, $2, $3}'`
# count the number of reads located in each region
samtools view exome.sort.bam $region | wc -l
done
IFS=$old_IFS
You can use -@ to use more threads to speed up the iteration
Hi yztxwd, this will give me the number of reads in each region. I am looking for number of bases in each region where the coverage is more than a certain depth (say 100X). Can I somehow replace the last 'samtools view' command above to count the bases above 100X? Thanks!
Hi, I think you can just replace the $region with each base in region, like chr1:10000-10000, but iterate each base will dramatically increase the computation time.
May I ask why you want per base coverage? I think maybe a better choice is to use bamCoverage (I have never done it on whole genome, so I am not sure about running time), you can set the bin size to 1 to count per base coverage, then use the base coordinates in your region file to compute the fraction by other tools (like python or R)
More than the per base coverage, I am looking for the fraction of region in the bed file which is more than 100X to see how the experiment performed. Basically, I want to know how much fraction of the exome is covered with 100X or more. Do you know of a quick way to do that given the Bam file and the exome bed file? Thanks for your help.
Hi yztxwd, this will give me the number of reads in each region. I am looking for number of bases in each region where the coverage is more than a certain depth (say 100X). Can I somehow replace the last 'samtools view' command above to count the bases above 100X? Thanks!
Hi, I think you can just replace the $region with each base in region, like chr1:10000-10000, but iterate each base will dramatically increase the computation time.
May I ask why you want per base coverage? I think maybe a better choice is to use bamCoverage (I have never done it on whole genome, so I am not sure about running time), you can set the bin size to 1 to count per base coverage, then use the base coordinates in your region file to compute the fraction by other tools (like python or R)
More than the per base coverage, I am looking for the fraction of region in the bed file which is more than 100X to see how the experiment performed. Basically, I want to know how much fraction of the exome is covered with 100X or more. Do you know of a quick way to do that given the Bam file and the exome bed file? Thanks for your help.