I'm trying to do a pathways/enrichment analysis and in order to do so, I'm trying to divide the prevalence of a variant in a gene by the length of the gene. Is there a database online where I can easily download gene lengths? I'm working with gene symbols (i.e. KCNQ2) so anything that matched the gene symbol to a length would be preferable.
Hi Pierre, sorry for my stupid question. Here, you showed an example with single gene 'KCNQ2", right? Could you please tell me how to do this with a batch of gene, for example, if I have a file with the gene symbols? Can I put the mysql command in a loop to iterate over the gene symbols?
The accession NC_000020.11 tells you KCNQ2 is mapped to chromosome 20 at bases 63404189 to 63477640. Note that this alignment is to assembly hg38, which you may or may not be using:
I've not seen a convenient database of this (this isn't usually stored explicitly), but you can determine this given an annotation file easily enough. Here's an example of how to do that (and a bit more) in R: Normalization Of Rna Sequencing Counts (By Ercc / Gene Length)
This was a long, complicated, and tiring process. Getting the variant density is annoying as well if you don't know how to use R very efficiently. I'm just about done with it. If I'm allowed to later, I will post the methods up. However, if you don't care about distinguishing regions that are intron/exon/UTR/coding/etc., then just look at the gtf. Extract the whole genes. If you want a specific region, you'll have to extract that type and concatenate the results from the gtf. If you're dealing with targeted sequencing, then the process is more complicated.
Hi Pierre, sorry for my stupid question. Here, you showed an example with single gene 'KCNQ2", right? Could you please tell me how to do this with a batch of gene, for example, if I have a file with the gene symbols? Can I put the mysql command in a loop to iterate over the gene symbols?
Thanks
remove the statement ' X.geneSymbol="KCNQ2" and'