Hi Tariq,
I have an quession on importing genome data in PyGeno. Since, The human reference sequence data was downloaded locally in HPC. The manifest.ini file was modified as following. It report a dug saying "sqlite3.OperationalError: disk I/O error",when I import the genome. However, the free disk space is enough in the HPC. Would you tell me how to fix such issue?
The platform I used is Python-2.7.13/PyGeno1.3.1 CentOS Linux release 7.3.1611 (Core)
Thank you very much.
Hao
manifest.ini
[package_infos]
description = Human reference genome
maintainer = Tariq Daouda
maintainer_contact = tariq.daouda@umontreal.ca
version = 1
[genome]
species = human
name = GRCh37.75
source = http://useast.ensembl.org/info/data/ftp/index.html
[chromosome_files]
10 = Homo_sapiens.GRCh37.75.dna.chromosome.10.fa.gz
11 = Homo_sapiens.GRCh37.75.dna.chromosome.11.fa.gz
12 = Homo_sapiens.GRCh37.75.dna.chromosome.12.fa.gz
13 = Homo_sapiens.GRCh37.75.dna.chromosome.13.fa.gz
14 = Homo_sapiens.GRCh37.75.dna.chromosome.14.fa.gz
15 = Homo_sapiens.GRCh37.75.dna.chromosome.15.fa.gz
16 = Homo_sapiens.GRCh37.75.dna.chromosome.16.fa.gz
17 = Homo_sapiens.GRCh37.75.dna.chromosome.17.fa.gz
18 = Homo_sapiens.GRCh37.75.dna.chromosome.18.fa.gz
19 = Homo_sapiens.GRCh37.75.dna.chromosome.19.fa.gz
1 = Homo_sapiens.GRCh37.75.dna.chromosome.1.fa.gz
20 = Homo_sapiens.GRCh37.75.dna.chromosome.20.fa.gz
21 = Homo_sapiens.GRCh37.75.dna.chromosome.21.fa.gz
22 = Homo_sapiens.GRCh37.75.dna.chromosome.22.fa.gz
2 = Homo_sapiens.GRCh37.75.dna.chromosome.2.fa.gz
3 = Homo_sapiens.GRCh37.75.dna.chromosome.3.fa.gz
4 = Homo_sapiens.GRCh37.75.dna.chromosome.4.fa.gz
5 = Homo_sapiens.GRCh37.75.dna.chromosome.5.fa.gz
6 = Homo_sapiens.GRCh37.75.dna.chromosome.6.fa.gz
7 = Homo_sapiens.GRCh37.75.dna.chromosome.7.fa.gz
8 = Homo_sapiens.GRCh37.75.dna.chromosome.8.fa.gz
9 = Homo_sapiens.GRCh37.75.dna.chromosome.9.fa.gz
MT = Homo_sapiens.GRCh37.75.dna.chromosome.MT.fa.gz
X = Homo_sapiens.GRCh37.75.dna.chromosome.X.fa.gz
Y = Homo_sapiens.GRCh37.75.dna.chromosome.Y.fa.gz
[gene_set]
gtf = Homo_sapiens.GRCh37.75.gtf.gz
bug
>>> import pyGeno.bootstrap as B
>>> B.importGenome("Human.GRCh37.75/")
Importing genome package: /home/yeh/program/Python-2.7.13/lib/python2.7/site-pac kages/pyGeno/bootstrap_data/genomes/Human.GRCh37.75/... (This may take a while)
Importing:
description: Human reference genome
maintainer: Tariq Daouda
maintainer_contact: tariq.daouda@umontreal.ca
version: 1
Genome:
species: human
name: GRCh37.75
source: http://useast.ensembl.org/info/data/ftp/index.html
...
Importing gene set infos from /home/yeh/program/Python-2.7.13/lib/python2.7/site -packages/pyGeno/bootstrap_data/genomes/Human.GRCh37.75/Homo_sapiens.GRCh37.75.g tf.gz...
Backuping indexes...
Droping all your indexes, (don't worry i'll restore them later)...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/yeh/program/Python-2.7.13/lib/python2.7/site-packages/pyGeno/bootstrap.py", line 105, in importGenome
PG.importGenome(path, batchSize)
File "/home/yeh/program/Python-2.7.13/lib/python2.7/site-packages/pyGeno/importation/Genomes.py", line 179, in importGenome
chros = _importGenomeObjects(gtfFile, chromosomeSet, genome, batchSize, verbose)
File "/home/yeh/program/Python-2.7.13/lib/python2.7/site-packages/pyGeno/importation/Genomes.py", line 257, in _importGenomeObjects
Transcript_Raba.flushIndexes()
File "build/bdist.linux-x86_64/egg/rabaDB/Raba.py", line 547, in flushIndexes
File "build/bdist.linux-x86_64/egg/rabaDB/rabaSetup.py", line 148, in dropIndexByName
File "build/bdist.linux-x86_64/egg/rabaDB/rabaSetup.py", line 224, in execute
sqlite3.OperationalError: disk I/O error
This sounded like a cool tool but I was unable to run it at all. Your installation fails on my machine right away
https://github.com/tariqdaouda/pyGeno/issues/2
also I strongly recommend disconnecting the data download from the python code - python is not all that well suited to downloading massive datasets - or at least provide alternatives via http rsync or bittorrent sources for the download of the data.
Thank you for bringing that up, the pip version was lagging behind. It is fixed now but I recommend the git version.
I had a look at the issue, the problem was that the folders containing the datawraps were not included in the pip version. But the rest of the installation went fine and you can import datawraps using the importation module.
I would nonetheless recommend that you either update pyGeno to the latest pip version to get the missing datawraps:
Or switch to the git version to get the latest bleeding edge updates.
Python is used for downloads to avoid dependencies to third party software, in order to simplify the installation as much as possible. That is also the reason why pyGeno comes with a set of parsers.
The datawraps shipped with the bootstrap module only contain links to data made available by third parties such as Ensembl and dbSNP. But you also have the possibility to create your own datawraps by downloading the files independently and including them into the tar.gz archive, as explained here and here
That being said, pyGeno has been tested many times with both Ensembl and dbSNP, and we never suffered any problem due to the initial downloads.
Thanks
Thanks for the fix. I like the concepts behind this pacakge and want to test it out in practice. More feedback to follow.
Thank you, your feedback is greatly appreciated.