Question

Gene length database

2

Entering edit mode

11.0 years ago

chloe.p.oconnell ▴ 90

Hi,

I'm trying to do a pathways/enrichment analysis and in order to do so, I'm trying to divide the prevalence of a variant in a gene by the length of the gene. Is there a database online where I can easily download gene lengths? I'm working with gene symbols (i.e. KCNQ2) so anything that matched the gene symbol to a length would be preferable.

Thanks!

Database ncbi gene-length • 14k views

ADD COMMENT • link updated 3.1 years ago by Sahel • 0 • written 11.0 years ago by chloe.p.oconnell ▴ 90

Ram · Answer 1 · 2014-09-19

3

Entering edit mode

11.0 years ago

Pierre Lindenbaum 166k

Using mysql/ucsc:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select X.geneSymbol,K.name as transcript, K.chrom,K.txEnd-K.txStart as transcriptLen from kgXref as X,knownGene as K where X.geneSymbol="KCNQ2" and X.kgId=K.name '
+------------+------------+-------+---------------+
| geneSymbol | transcript | chrom | transcriptLen |
+------------+------------+-------+---------------+
| KCNQ2      | uc002yex.3 | chr20 |          1804 |
| KCNQ2      | uc002yey.1 | chr20 |         66452 |
| KCNQ2      | uc002yez.1 | chr20 |         66452 |
| KCNQ2      | uc002yfa.1 | chr20 |         66452 |
| KCNQ2      | uc002yfb.1 | chr20 |         66452 |
| KCNQ2      | uc002yfc.1 | chr20 |         38962 |
| KCNQ2      | uc011aax.1 | chr20 |         48707 |
+------------+------------+-------+---------------+

ADD COMMENT • link updated 3.7 years ago by Ram 45k • written 11.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi Pierre, sorry for my stupid question. Here, you showed an example with single gene 'KCNQ2", right? Could you please tell me how to do this with a batch of gene, for example, if I have a file with the gene symbols? Can I put the mysql command in a loop to iterate over the gene symbols?

Thanks

ADD REPLY • link 9.2 years ago by taniasultana2004 ▴ 10

0

Entering edit mode

remove the statement ' X.geneSymbol="KCNQ2" and'

ADD REPLY • link 9.2 years ago by Pierre Lindenbaum 166k

Ram · Answer 2 · 2014-09-19

1

Entering edit mode

11.0 years ago

Alex Reynolds 36k

You might look up the RefSeq accession and version name:

$ wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/gene_RefSeqGene
...
$ grep 'KCNQ2' gene_RefSeqGene
9606    3785    KCNQ2   NG_009004.1

Then look up the alignment:

$ wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/GCF_000001405.26_refseqgene_alignments.gff3
...
$ grep 'NG_009004.1' GCF_000001405.26_refseqgene_alignments.gff3
NC_000020.11    RefSeq  match   63404189        63477640        73452   +       .       ID=aln1679;Target=NG_009004.1 1 73452 -;bit_score=136701;e_value=0;gap_count=0;num_ident=73452;num_mismatch=0;pct_coverage=100;pct_identity_gap=100;pct_identity_gapopen_only=100;pct_identity_ungap=100;score=73452

The accession NC_000020.11 tells you KCNQ2 is mapped to chromosome 20 at bases 63404189 to 63477640. Note that this alignment is to assembly hg38, which you may or may not be using:

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr20%3A63404189-63477640

ADD COMMENT • link updated 3.7 years ago by Ram 45k • written 11.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Is it possible to do this in python?

ADD REPLY • link 11.0 years ago by chloe.p.oconnell ▴ 90

0

Entering edit mode

There is a library to wget in python. You can do regular expressions in python with re. Pierre's answer is better, probably.

ADD REPLY • link 11.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you for the code! I have a problem and that is when I get the file: wget https://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/GCF_000001405.25_refseqgene_alignments.gff3 there are only 7750 GeneRef Accessions. I'm a bit confused about that because I need the length of all of the genes and not 7750 of them. Is there any way to find the length of all of the genes? Thanks!

ADD REPLY • link 3.1 years ago by Sahel • 0

Ram · Answer 3 · 2014-09-19

0

Entering edit mode

11.0 years ago

Devon Ryan 105k

I've not seen a convenient database of this (this isn't usually stored explicitly), but you can determine this given an annotation file easily enough. Here's an example of how to do that (and a bit more) in R: Normalization Of Rna Sequencing Counts (By Ercc / Gene Length)

ADD COMMENT • link updated 3.7 years ago by Ram 45k • written 11.0 years ago by Devon Ryan 105k

score 0 · Answer 4 · 2016-07-02

0

Entering edit mode

9.2 years ago

EagleEye 7.6k

A: Length of transcripts of mm9

ADD COMMENT • link 9.2 years ago by EagleEye 7.6k

score 0 · Answer 5 · 2016-07-03

0

Entering edit mode

9.2 years ago

anp375 ▴ 190

This was a long, complicated, and tiring process. Getting the variant density is annoying as well if you don't know how to use R very efficiently. I'm just about done with it. If I'm allowed to later, I will post the methods up. However, if you don't care about distinguishing regions that are intron/exon/UTR/coding/etc., then just look at the gtf. Extract the whole genes. If you want a specific region, you'll have to extract that type and concatenate the results from the gtf. If you're dealing with targeted sequencing, then the process is more complicated.