Hi,
I am analyzing 96 samples sequenced using NextSeq PE 150 reads. The goal is to study variants and per exon coverage of 10 genes. I have obtained the bed file for all the exonic regions of the 10 genes from UCSC genome browser. I am using trim_galore -> bowtie2 -> samtools pipeline for calling variants. I am using hg38 human genome assembly for alignment.
I calculated per exon coverage using bedtools intersect command as follows:
bedtools coverage -abam Sample-1.sorted.bam -b sorted.intervals.bed -counts > PerExonCoverage.txt
This gives me an output file in following format:
Chromosome Start End Coverage
chr1 161677476 161677553 6561
chr20 1895448 1895526 2377
chr20 1915099 1915455 12081
I want to know the coverage of each nucleotide position for all exons for e.g as follows:
chr1 161677476 13
chr1 161677477 120
chr1 161677478 130
and so on..
I have an intermediate mpileup file which I have used to extract coverage of all positions like this:
cut -f1-4 Sample-1.mpileup > Sample-1.depth
This gives me coverage for all positions of the reference genome and not for the positions of selected exons.
The above command also results in a file size of almost 60GB per sample which is very huge for the resources available to me.
Is there a way in which I can get the coverage of all positions but only for exons listed in BED using either mpileup or BAM file.
Help will be appreciated. Thanks in advance!!
Thanks Pierre for quick reply. I performed the above command and got the output but in the output the first position for every exon is excluded. For example for the exon range:
The coverage starts from the
46372904
position instead of46372903
. The length of the exon is 1059 bp but coverage is reported for 1058 bp. Should I subtract 1 from every start position and calculate again? Will this be a correct approach?Also, I changed the range from
chr3 46372903 46373961
tochr3 46372895 46373961
and it gives me coverage for the region outside exon as well. What is the reason for obtaining coverage outside the exonic regions? Shouldn't the coverage of regions outside exon be 0 or very low? The exonic region was captured using IDT predesinged gene capture pool.it's a bed file. It's ZERO based and uses half-open intervals. The output of samtools is ONE-based.
see Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems