I used the following codes in python 3.5 to scrape data from multiple web xml pages. It worked once and then stopped working when I tried again. Do you know why and how I can fix it? The following is part of a longer script:
from urllib.request import urlopen
import urllib
urls=["https://www.ebi.ac.uk/ena/data/view/ERS1887141&display=xml",
"https://www.ebi.ac.uk/ena/data/view/ERS1887140&display=xml",
"https://www.ebi.ac.uk/ena/data/view/ERS1887139&display=xml"]
for url in urls:
contents = None
while contents is None:
try:
s = urlopen(url)
contents = s.read()
except urllib.error.URLError:
print("urllib timeout")
pass
It worked the first time I tried to open a single url. It failed after I tried to open it again or open another url from the same web domain. When I remove the exception I get the following error. Increasing the timeout time did not solve the problem. It seems like my error has to do with the TLS/SSL connection.
Unlike NCBI, the European Nucleotide Archive (ENA) did not include the metadata (info table) for all samples in the study PRJEB99111 in a file. I have seen the metadata only in html or xml format - hence the my url scraping. Please let me know if you can find this metadata/info file.
Traceback (most recent call last):
File "/home/amirza/.conda/envs/py35/lib/python3.5/urllib/request.py", line 1254, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/home/amirza/.conda/envs/py35/lib/python3.5/http/client.py", line 1107, in request
self._send_request(method, url, body, headers)
File "/home/amirza/.conda/envs/py35/lib/python3.5/http/client.py", line 1152, in _send_request
self.endheaders(body)
File "/home/amirza/.conda/envs/py35/lib/python3.5/http/client.py", line 1103, in endheaders
self._send_output(message_body)
File "/home/amirza/.conda/envs/py35/lib/python3.5/http/client.py", line 934, in _send_output
self.send(msg)
File "/home/amirza/.conda/envs/py35/lib/python3.5/http/client.py", line 877, in send
self.connect()
File "/home/amirza/.conda/envs/py35/lib/python3.5/http/client.py", line 1261, in connect
server_hostname=server_hostname)
File "/home/amirza/.conda/envs/py35/lib/python3.5/ssl.py", line 385, in wrap_socket
_context=self)
File "/home/amirza/.conda/envs/py35/lib/python3.5/ssl.py", line 760, in __init__
self.do_handshake()
File "/home/amirza/.conda/envs/py35/lib/python3.5/ssl.py", line 996, in do_handshake
self._sslobj.do_handshake()
File "/home/amirza/.conda/envs/py35/lib/python3.5/ssl.py", line 641, in do_handshake
self._sslobj.do_handshake()
ssl.SSLZeroReturnError: TLS/SSL connection has been closed (EOF) (_ssl.c:719)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/amirza/.conda/envs/py35/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "/home/amirza/.conda/envs/py35/lib/python3.5/urllib/request.py", line 466, in open
response = self._open(req, data)
File "/home/amirza/.conda/envs/py35/lib/python3.5/urllib/request.py", line 484, in _open
'_open', req)
File "/home/amirza/.conda/envs/py35/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/home/amirza/.conda/envs/py35/lib/python3.5/urllib/request.py", line 1297, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/home/amirza/.conda/envs/py35/lib/python3.5/urllib/request.py", line 1256, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error TLS/SSL connection has been closed (EOF) (_ssl.c:719)>
I am not sure exactly what metadata you are looking for but if you click on the "TEXT" link on this page you can get a detailed dump of all samples (including the data download locations).
'TEXT" does not contain metadata (age, sex, disease status, etc). The metadata can be found by clicking on the "Attributes" tab for each sample individually (147 samples all together). This page contains the metadata for one of the samples: https://www.ebi.ac.uk/ena/data/view/SAMEA104228118