I have a list of ~2000 genes that are in copy number variations(CNVs). We suspect these CNVs to be related to the observed phenotype. I am trying to determine if (and how many) of the genes overlapped are involved in a particular development process.
Currently I'm trying to hack together a python script to text-mine using wget and iHOP (getLatestSymbolInformation
) for all genes in my list. I then would search the XML response for processY
and output a 1 for all genes in my list where processY
was found in the iHOP response and 0 where it was not. I could run the script for a list of genes in CNVs of control subjects and see if more of the case genes are associated with processY
than control genes.
#Get all xmls from iHOP
fname = raw_input('Enter the gene list filename: ')
try:
fhand = open(fname)
subprocess.call("mkdir iHOP_results", shell = True)
for line in fhand:
if line == "" : break
gene_symbol = line.rstrip()
iHOP_url = "http://ws.bioinfo.cnio.es/iHOP/cgi-bin/getLatestSymbolInformation?synonym=%s&ncbiTaxId=9606" % gene_symbol
shell_cmd = "wget -O iHOP_results/%s %s" % (gene_symbol, iHOP_url)
#print repr(shell_cmd)
subprocess.call(shell_cmd,shell = True)
except:
print 'File cannot be opened:', fname
exit()
finally:
fhand.close()
This works if my gene file only has a couple of entries, but fails with
unable to resolve host address `ws.bioinfo.cnio.es' failed: Name or service not known.
with a large file.
Anyone have any ideas 1) how to fix my python script or 2) a better method of testing if the case gene list is more closely associated to a specific developmental process than a control gene list?
see Retrieve All Genes Associated With A Go Term