Well, answer is a little bit involved and uses biopython. First, we need to convert gene symbols do Entrez Gene IDs. I do this by brute force mapping. Get the gene_info from NCBI. Convert the file to some form of delimiter separated values file (I use tab separated values with header).
import csv
# Creates the mapping
with open('./data/Homo_sapiens.gene_info.tsv') as f:
reader = csv.DictReader(f, delimiter='\t')
mapping = dict()
for row in reader:
syn = row['Synonyms']
syn = syn.split('|')
tag = [row['GeneID'], row['Symbol']]
for s in syn:
mapping[s] = tag
mapping[row['Symbol']] = tag
with open('your_genes.txt') as f, open('mapped.tsv', 'w') as w:
header = ['GeneID', 'Symbol']
writer = csv.writer(w, header, delimiter='\t')
writer.writerow(header)
for line in f:
gene_symbol = line.strip()
if gene_symbol in mapping.keys():
writer.writerow(mapping[gene_symbol])
else:
print 'Not found: ' + gene_symbol
It's important to stress that GeneCards and NCBI not always share the same gene symbol/synonyms. Now, we can proceed to the fun part:
import csv
from Bio import Entrez, SeqIO
def fetch_mRNA_lengths(GeneID):
"""Get all mRNAs for a given GeneID from Entrez Gene and prints its accessions, GIs and lengths"""
try:
handle = Entrez.efetch(db='gene', id=GeneID, retmode='xml')
record = Entrez.read(handle)[0]
# The first Entrezgene_locus is for the GRCh38 Primary Assembly
locus = record['Entrezgene_locus'][0]
products = locus['Gene-commentary_products']
for product in products:
# 3 is for mRNA
if product['Gene-commentary_type'] == '3':
accn = product['Gene-commentary_accession']
gi = product['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
handle = Entrez.efetch(db='nucleotide', id=gi, rettype='gb')
record = SeqIO.read(handle, 'gb')
print GeneID, accn, gi, len(record)
except:
print 'Not found: ' + GeneID
Entrez.email = "insert.your@email.here"
with open('mapped.tsv') as f:
reader = csv.DictReader(f, delimiter='\t')
for row in reader:
fetch_mRNA_lengths(str(row['GeneID']))
Have fun!!!
What format do you have your data in?
Bed file format of the transcription start site
So you don't have end and exon coordinates? Are these novel transcripts or known transcripts?
No only the TSS and these are known genes only to start with yes
Would it then suffice to use the lengths of already annotated transcripts? Then you could download the transcript annotations from ucsc or ensembl as gff by using the transcript ids. If you do not have the transcript ids but only positions you could search all transcripts for overlap with or closest transcripts to your TSS.
Do you have gene IDs or Symbols? Do you need to know the length of the gene or the (average) length of known transcripts for that gene?
He sorry for late reply just saw it now. I have the gene offical symbols and would like to know the average lenght of known transcripts of the gene yes.