Dear all
I have been to trying to download all complete bacterial genomes (specifically their faa aa sequences) from refseq in order to create a diamond database however I can only download successfully a very small portion of them!
What I do is:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
awk -F '\t' '{if($12=="Complete Genome") print $20}' assembly_summary.txt > assembly_summary_complete_genomes.txt
awk 'BEGIN{FS=OFS="/";filesuffix="protein.faa.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' assembly_summary_complete_genomes.txt > ftpfilepaths
cat ftpfilepaths | parallel -j 20 --verbose --progress "curl -O {}"
gunzip *gz
and here comes the issue! Out of the approximately 20.000+ protein.faa.gz files only ca 200-400 can be extracted properly and for the rest I get an "invalid compressed data--format violated"
If I try to make a new ftpfile using only the "corrupted" gz files and try to re-download, once again it will download everything but only another 300-400 gz will be uncompressed successfully while the rest are "invalid compressed data"
If I try to download with curl the failed to uncompress gz files, then the files can be downloaded and uncompress fine - so the problem is when using parallel? or when I am trying to download all of them together?
I remember getting a similar error in the past but it was only for a very small portion of the data, so I am wondering what is going on?
I am also very troubled about the fact that I cannot find any internet threads with a similar issue, am I the only one getting this? Am I doing sth fundamentally wrong?
Thanks
P
thanks, will give
datasets
a try - however, out of curiosity, I have used parallel in this context plenty of times in the past - I mean A LOT, and I never had that problem... so, has sth changed recently?Have you tried to reduce the number of
parallel
jobs to see if that results in successful downloads? Looks like you are using 20 at the moment.There could be many reasons why this is no longer working (since it did in past). My speculative list in random order:
I don't know how often the list of assemblies changes (weekly?) but perhaps you can just download the entries that changed instead of getting the entire set each time?
less jobs in parallel did the trick, only a 1% failed - but started using the datasets as well, thanks!