Question

Gene symbol to mRNA sequence (refseq.rna)

0

Entering edit mode

9.8 years ago

rrcutler ▴ 120

I have designed code using modules from mygene.info for gene annotation query (converting gene symbol to refseq.rna) and Biopython for refseq.rna reference to an mRNA fasta file. I have designed this specifically to find mRNA-seq specifically for the model organism Xenopus Laevis.

While this works, it does not successfully find all the gene symbols that are input. For example using the input genes:

notch1, sox2, myf5, myod1, prmt5, wnt1, pax3, zic1, snai2, twist1, foxd3, chrd, bmp4, otx2, nog, lef1, dll1, tubb2b, neurog1, act3

I only can get a return of 11 mRNA sequences in the output fasta file. While all these genes are on Xenbase, I am confused as to why the program is unable to find them.

#Ronald Cutler - Saha Lab 150617
"""A program that takes gene symbol inputs in the Gene List.txt and
 then outputs respective mRNA sequences in the mRNA.fasta"""

import mygene
from Bio import Entrez

#initiate lists and mygene object
mg = mygene.MyGeneInfo()
ids = []
mRNA = []
not_found = []

#read gene symbols from inputfile to ids list
#change this directory if needed
inputfile = open('/Users/confocal/Desktop/Ronald Cutler/Targeted RNA Expression Programs/Gene symbol to mRNA-seq mapping/Gene List.txt', 'r')
for line in inputfile:
    ids.append(line.strip('\n'))
count1 = len(ids)

print('Gene sequence list:', ids)
print()
print('Numebr of input genes=', count1)
print() 
#executing mygene object
out = mg.querymany(ids, species='8355', scope = 'symbol, xenbase', fields='refseq.rna', size= 1)

#extracting mRNA NCBI references from dictionary
for dic in out:
    try:
        mRNA.append(dic[u'refseq.rna'])
    except KeyError:
        not_found.append(dic[u'query'])

#printing out symbols that could not be found
if len(not_found) > 0:
    for symbols in not_found:
        symbols.strip("u")`

    print('These genes could not be found in the database', not_found)
print(mRNA)
mRNA = " ".join(mRNA)
count2 = len(mRNA.split())

#checking if all input genes were given respective mRNA NCBI reference
if count1 != count2:
    print("Warning, either an mRNA sequence was not found or there is a" \
    "duplicate entry of a gene ID. Please Check")

#Searching and outputing respective mRNA sequences
Entrez.email = "rrcutler@email.wm.edu"
handle =Entrez.efetch(db="nucleotide", id=mRNA, rettype="fasta", retmode="text")
result = handle.read()
handle.close()

#writing output file
#change this directory if needed
outputfile = open("/Users/confocal/Desktop/Ronald Cutler/Targeted RNA Expression Programs/Gene symbol to mRNA-seq mapping//mRNA.fasta", "w")
outputfile.write(result)

print("Number of output mRNA sequences=", count2)

inputfile.close()
outputfile.close()

mRNA Xenbase Biopython NCBI RNA-Seq • 3.2k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by rrcutler ▴ 120

0

Entering edit mode

Specifically, the genes not found were: myod1', pax3', snai2', twist1', foxd3', bmp4', otx2', dll1', neurog1

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by rrcutler ▴ 120

0

Entering edit mode

is it possible you should have used 'pax3-a'?

ADD REPLY • link 9.8 years ago by russhh 5.8k

0

Entering edit mode

After trying that I received a result for that respective gene! Perhaps it's just a problem with input nomenclature then..

ADD REPLY • link 9.8 years ago by rrcutler ▴ 120

0

Entering edit mode

'Input nomenclature' makes it sound like a computer error rather than a human error.

You have to use the appropriate gene symbols.

ADD REPLY • link 9.8 years ago by russhh 5.8k