I have a list of genes and I'm looking for a programmatic way to retrieve the length of each gene. For instance, in UniProt the length of a gene is specified in the Sequences section. Can we query this using a script
Many thanks for the time and attention
Edit by @genomax: Since this question is about length of a protein title edited to reflect that requirement.
Hello deepamahm.iisc,
You've stated that you want to retrieve the length of a gene. What have you tried? Have you put any effort into thinking about this? The answer to these should be "yes", and you should elaborate a bit on:
If you don't demonstrate that you've put effort into the problem you're facing, people cannot be expected to help you solve it.
-Vijay
Please provide examples of ID's and source where you want to get this info from.
Many thanks for the response. Sorry for being less specific, the gene id's are the UniProt Identifiers. For example, P35575, it's isoform P35575 -1. The source from where I want to get this data from is UniProt
Define length of gene, is that all exonic regions or also the introns? If your organism is annotated, try to use the annotation files from ensembl or UCSC. In these gtf file you can extract the coordinates of each gene or exon.
The length that I am referring to is the one specified in UniProt. Please find the snapshot attached
If you obtain the FASTA record, you can obtain the length of that by just counting the characters. There will always be 'dispute' over what is the true length of a gene, though.
If the NGS protocol is targeting spliced mRNA, then the 'gene length' should be that of the spliced mRNA that is transcribed and then sequenced, etc.
Thanks a lot Kevin! I have scripts for querying the FASTA sequences.
Be aware of the distinction to which b.nota relates, i.e., there is a difference between:
The one you choose may be important for downstream analyses.
Excuse me for the naive question. In microarray experiments, the probe binds to the target gene and the expression levels are obtained. I'm looking for the length of the genes reported in those studies. Currently, I look at the corresponding protein id in UniProt to obtain the length and convert it to gene length by multiplying with 3. But, I understand what I'm doing is wrong from the explanation given in the answers to my post. Could you please explain, among the 3 lengths stated above, which length should be computed if I am looking at the genes reported in microarray results?
Be aware that the length you show in your attached figure is not the gene length but protein length.
I usually download the Ensembl GTF file and read in each gene start and end sites using pd.read_csv, and compute the length using the start and end sites.
Hope this help.