Question

How to obtain the non-redundant total size of the GRCh38 genome?

0

Entering edit mode

8.7 years ago

salvatore.digiorgio ▴ 10

I am trying to obtain the overall size of the exome, i.e. every single position that has coverage. As a control I need to find the theoretical size from the reference genome. I have a few questions:

Is there an official size for the exome from the GRCh38 genome?
if not, which is there efficient way to calculate it? A script to merge/coalesce all the exon from the reference genome?

Thanks,
Salvo

exome • 3.0k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 8.7 years ago by salvatore.digiorgio ▴ 10

1

Entering edit mode

It should be possible to convert the gene annotation into a GRange object (like here). Once this is done, computing exome size is easy if you know R (see here for instance).

ADD REPLY • link 8.7 years ago by Carlo Yague 9.0k

0

Entering edit mode

Hi Salvo,

You'll get all covered bases of your alignment, whether these fall into exons or not. I think the best way to get your statistics is to generate/download a exon bed file (Biomart would be the easiest way). With this you can compute directly the sum of all features, and intersect it with your coverage file.

You have to take care of alternative splicing within an exon, alternative polyadenylation site usage, etc.. Also you should think of including/excluding the alternative scaffold/patches (e.g. GL000194.1 , KI270726.1 ,...).

Cheers, Michael

[Update] This comment was written as a response to a given awk-script. Which is removed from the original post.

ADD REPLY • link 8.7 years ago by michael.ante ★ 4.0k

score 1 · Answer 1 · 2016-12-12

1

Entering edit mode

8.7 years ago

harold.smith.tarheel ★ 5.0k

The easiest way is probably to use BEDtools 'genomecov' with the genome and exonic GFFs.

ADD COMMENT • link 8.7 years ago by harold.smith.tarheel ★ 5.0k

score 1 · Answer 2 · 2016-12-12

1

Entering edit mode

8.7 years ago

venu 7.1k

This has been answered a while ago under - How to obtain the length of coding regions for the list of genes?

All you need is

GTF file
BedTools

ADD COMMENT • link 8.7 years ago by venu 7.1k

0

Entering edit mode

Thanks Venu for your helps, I used your command line:

cat gencode.v22.annotation.gtf | awk '{if($3=="exon") print $10"\t"$5-$4}' | sed -e 's/"//g' -e 's/;//' | bedtools groupby -i - -g 1 -c 2 -o sum > Exon_lengths.txt

and I obtained 290,590,453 (GRCh38 used in GDC data common for WXS).

ADD REPLY • link 8.7 years ago by salvatore.digiorgio ▴ 10

2

Entering edit mode

Just a couple of comments on this script...

Shouldn't feature length be $5-$4+1? Plus one because GTF starts from 1 not 0.
print $10 prints the first attribute of the of the attribute field (actually the string between the 9th and 10th blank character). This may be ok but it may be a bit brittle.
To compute the total length you cannot just sum Exon_lengths.txt. I think you should merge the coordinates of the exons after you extract them from the GTF. If exons overlap or are shared among transcripts then you double count their length if you don't merge.

ADD REPLY • link 8.7 years ago by dariober 15k

score 0 · Answer 3 · 2016-12-15

To obtain the same thing from Bam coming from WXS, I used this strategy:

I made the pileup with using the following command line:

samtools mpileup 0017ba4c33a07ba807b29140b0662cb1_gdc_realn.bam > 0017ba4c33a07ba807b29140b0662cb1_gdc_realn.pileup

later I used this command line:

awk '{if($4 > 10)SUM+=1} END{print SUM}' 0017ba4c33a07ba807b29140b0662cb1_gdc_realn.pileup

the coverage is $4, and I obtained 95,972,925.