Whole exome length of exons from bam file
2
1
Entering edit mode
9.2 years ago

Hi,

I am trying to figure out the total length of all the exons covered in whole exome sequencing data which I downloaded from TCGA. In other words I am trying to figure out what is the total length of all the bp covered by the reads on the reference genome. Fastq file would probably not help since the number of reads would vary depending on the depth of sequencing. I tried to use the BAM files and samtools but could not find any tool in samtools that would give me total length of all the alignment coordinates. Is there a tool available to get this information or does someone know of a script that can help with this calculation. Basically I would need to extract the coordinates of all the unique alignments, calculate their length and add them up, unless there is some other subtlety that I am missing. Thanks for your help.

- Pankaj

whole-exome length • 3.6k views
ADD COMMENT
0
Entering edit mode
9.2 years ago
Darked89 4.7k

Maybe figuring out the number of bases with coverage > 0 will be good enough?

http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html

Using samtools view you can filter out the zero quality reads, and digging deeper in SAM flags non-primary alignments.

Hope it helps

ADD COMMENT
0
Entering edit mode
Do you have a bed file of all the exons?
ADD REPLY
0
Entering edit mode

I don't have a bed file of the exons but I have sent a request to CGHub help desk to ask for one. Assuming I can get one, could you please advise how I could use it to get the information I need. Thanks.

ADD REPLY
0
Entering edit mode

To size a bed file

awk -F '\t' 'BEGIN{SUM=0}{SUM+=$3-$2}END{print SUM}' foo.bed
ADD REPLY
0
Entering edit mode

I used the following command to output [chr pos depth]:

bedtools genomecov -d -ibam bedtools_test_alignment_sorted.bam > bedtools_test_output.txt

but the out file is huge, almost 48 GB, for a BAM file of 8 GB. This file can be processed but will take a while. Is there some way that bedtools will only output positions with depth > 0. That would reduce the output file size a lot. Thanks.

ADD REPLY
0
Entering edit mode
9.2 years ago
bioguy24 ▴ 230

Will this awk pipe work?

bedtools genomecov -ibam aln.bam -bga | awk '$4==0' | head -n 2 

chr1 0 554304 0 
chr1 554314 554315 0
ADD COMMENT

Login before adding your answer.

Traffic: 2661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6