Is it possible to download the full EpiCov database from GISAID?
3
3
Entering edit mode
4.7 years ago
dgarcia54 ▴ 30

Hello everybody!

As the tittle of the post says, I need to download all the isolates of SARS-COV-2 from GISAID database... I've been searching for a way to download the whole data but I wasn't able to find out how to do it or if its possible to! Currently I am downloading one by one, but there are more than 700 entries... I hope someone could help...

Thanks in advance!

GISAID genome Database EpiCoV • 15k views
ADD COMMENT
3
Entering edit mode
4.7 years ago

There used to be an awkwardly placed button in the low right corner that will get you all the sequences. Then there is another Excel file with some metadata.

To get all the metadata you will have to download each PDF file.

ADD COMMENT
0
Entering edit mode

Thanks Istvan! I found the button that donwloads an Excel file, but I did not find the other one... Thanks a lot for your answer! I will keep looking for a way to do it!

ADD REPLY
0
Entering edit mode

I also did not find the 'link' for sequences. I sent a help message to through the website. waiting for the reply. one month ago, I choose the most "STUPID" way to download the sequences one-by-one (~200 records). I will not do it anymore (1k+ records up-to-date!)

ADD REPLY
0
Entering edit mode

I looked up my records, I complained profusely about this issue more than a month ago, on February 13th, 2020, their "support" personnel sent me the image below as the method for downloading all data at once. Indeed the button was there but in an awkward region, all the way at the bottom left of the page that I've not noticed before.

After pointing out how absurd the whole system is they removed my account and I've not been able to log in.

Does this button not exist anymore?

enter image description here

ADD REPLY
1
Entering edit mode

No Download button there, I checked it again.

Cannot believe they REMOVED your account. I just sent them a message about download issue, hope they will not be very __angry__ about it!

ADD REPLY
0
Entering edit mode

Thanks GISAID, I listen a call from the GISAID website. And I received the whole FASTA sequences in one-file. They make the batch download available for me (Button on Bottom right of the page). I was told DO NOT share the data with anyone else, I need to be responsible for the data.

What I did is click the "Contact" button, and ask for help. Hope it will helps. @dgarcia54, you need do it yourself.

ADD REPLY
0
Entering edit mode

After attempting to batch download they removed my account as well with no notice. Not sure how to move forward from this point.

ADD REPLY
0
Entering edit mode

I would contact the GISAID website maintainer through the "Contact" page.

What you mean "batch download". through the "Download button", or you use a script to parse the website. How long you found your account was removed after batch download?

ADD REPLY
0
Entering edit mode

I have contacted the Contact page but they haven't gotten back to me. I also found in their terms of use page that they actually have a clause stating that they could remove accounts with no reason/notice/explanation, so it could be expected that they may simply just not respond to my message sent via their Contact page.

The 'Download button' didn't exist at the point I was trying to download so I mean using a script I found on github (that had apparently previouosly successfully accomplished such multiple downloads from the page) to parse the website. I wasn't able to get the script to work on my machine through, but it was approximately 8hours of me attempting to get it to work before I found my account was no longer accessible.

ADD REPLY
0
Entering edit mode

I known, thanks for your reply.

I do think it is not a good idea using a "spider" to parse the website, this is the main reason your account was banned.

You may keep trying to contact the maintainer, and do not break their rules.

ADD REPLY
0
Entering edit mode

Fair. As someone had done it successfully before me though with no negative feedback from GISAID towards them, I thought that it was alright. I had also checked the terms of use and such is not stated as forbidden/against the rules as well.

ADD REPLY
0
Entering edit mode

data/gisaid_cov2020_sequences.fasta

I am unable to get this from the site.There is no download option for this ,only acknowledgement table is there.

ADD REPLY
1
Entering edit mode

To my experience, I sent the website maintainer a message through "contact" on the top-right of the page. They will activate the "download" button for you.

ADD REPLY
0
Entering edit mode

Thanks,I also have msged them by contact,but nothing updated till now.

Why this is so that this feature is activated for some users and not for all?I am not able to figure out this.

ADD REPLY
0
Entering edit mode

I'm not sure what is going on with that website. I received access this morning. Using my desktop computer, I could not find the download button anywhere on the page. I logged into GISAID from my laptop and the button was magically there - same browser version, same version of Windows.

ADD REPLY
3
Entering edit mode
4.7 years ago
Michael ▴ 270

I just got access to GISAID. Their interface is pathetic. Please researches, upload your data to INSDC (http://www.insdc.org/) which means uploading to NCBI GenBank, ENA or DDBJ. Or maybe also to Chinese National GenBank. Please keep the data open!

For example NCBI's interfaces are of magnitudes better, why maintaining such secondary (inferior infrastructure)?

Or is there any option for batch download of all genome assemblies? From the comments in this forum I don't think so (or not any more).

ADD COMMENT
1
Entering edit mode

The download button exists and works just fine. I don't know why some people are having problems with it. The gisaid server seems to be under a high load quite often. I don't know..

Whenever I download the complete file, this is the very first thing I do:

awk '{if(/^>/){h=$0;gsub(">","",$0);gsub("/","_",$0);gsub(" ","_",$0);n=substr($0,1,length($0)-1);print h>n".fna"}else{gsub("-","",$0);print $0>n".fna"}}' gisaid_cov2020_sequences.fasta
mv gisai* ../

for f in *fna; do
    dos2unix "$f"
    awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' "$f" > "$f".tmp
    awk 'BEGIN{FS="\t";OFS="\n"}{gsub(/^N*/,"",$2);gsub(/^n*/,"",$2);gsub(/N*$/,"",$2);gsub(/n*$/,"",$2);print $1,$2}' "$f".tmp > "$f"
    rm "$f".tmp
done
ADD REPLY
0
Entering edit mode

OK, now it is there... It was not so far.

Maybe you have to specifically ask for it. I wrote them a mail yesterday that batch download would be very helpful. They might have added a flag for batch download now. Whatever...

EDIT: Thanks for the script by the way!!

ADD REPLY
0
Entering edit mode

I've been using the site since Jan and from what I recall, the button appeared when there were like 200+ genomes..

ADD REPLY
0
Entering edit mode

yeah, think about that for a second, you get data from an official repository and the first thing you need to do is fix it up. They used to even label Hong Kong with a space where no spaces are allowed, thereby breaking the fasta id and many tools that rely on unique ids to work properly.

ADD REPLY
1
Entering edit mode

Well, it's not ideal. However, I do understand that labs want to be acknowledged for their contributions. Especially here it's more than likely that some people involved in the process from sample taking to library preparation have died from the disease..

ADD REPLY
1
Entering edit mode

yes fully agree,

but let's also realize that acknowledging/crediting/recognizing/citing and valuing work has absolutely nothing to do with limiting access and relicensing data.

ADD REPLY
0
Entering edit mode

Istvan, 100% agree. I think the way NCBI, ENA and others handle this, provides the same level of acknowledgment.

To me it looks a bit like this is just some way of monopolizing the data in order to still be "important".

ADD REPLY
0
Entering edit mode

When you use sequence data from the NCBI, ENA, etc, you do not have to acknowledge the sequence contributor and nobody does. I think in particular GISAID should improve access to the sequence data, but anyway, what we have now is far better than what we had during any past outbreak..

ADD REPLY
1
Entering edit mode

Recognition is implicit though. You always refer to NCBI/ENA accession number(s) that anyone can lookup easily and see metadata associated with it.

ADD REPLY
1
Entering edit mode

In what way does a statement like "the sequences have been obtained from GISAID and you can only get them from there" acknowledge the original authors in any way?

In general, everyone does acknowledge the sources or at least it is a standard scientific practice to cite the origins of the data as long as you are finding something of interest.

For example, if you say A is most similar to B you will need to cite B for sure. That's the scientific recognition right there.

There is really no need to cite GISAID as the data source, yet that's what is happening, that is what they are after. They want to get cited and reap all recognition.

ADD REPLY
0
Entering edit mode

I'm not here to defend nor judge GISAID, but why do you think everyone is uploading their data there instead of say the NCBI? BTW If I had my way, the submitters would be required to share their raw sequence data. Like one third of the genomes are unusable because people have no clue of what they're doing

ADD REPLY
0
Entering edit mode

I think this should be investigated and understood. I don't know how it ended up like that or why. Possible reasons include:

  • the majority of scientists do not fully understand what GISAID does with their data
  • the majority of scientists do not understand that they are not even allowed to relicense their data in the first place
  • GISAID has the first mover's advantage.
  • Reviewers probably ask people to deposit where everyone else is already.

In general, don't mean to imply that GISAID does not do beneficial things for scientists. They should be funded and supported for the value that they add to the process.

The petition and complaint is about allowing data access freely and in an unimpeded manner.

I can't believe that in 2020 during a major pandemic the source to the data is locked away.

ADD REPLY
1
Entering edit mode

I guess what would solve the problem is if GISAID would allow INSDC to sync data across their databases. Which is not possible the way GISAID operates at the moment.

Fun thing is: GISAID imports data from open INSDC databases and incorporate it into their EpiFlu database ;)

ADD REPLY
0
Entering edit mode

Istvan, this was my thinking as well! Why is it necessary to execute "dos2unix" first. WTF. :(

One should really encourage every single researcher to submit to NCBI, ENA, et al.

ADD REPLY
0
Entering edit mode

I tried to ask why, but for example the flu people are just used to submit data to GISAID. It's mostly because other flu data are there too. It's weird because just a few years ago (at least 2014) they used NCBI for that.

ADD REPLY
0
Entering edit mode

Did this script function for adding the batch download button? Or are there some other measures needed to do?

ADD REPLY
0
Entering edit mode

The script has nothing to do with the web UI

ADD REPLY
0
Entering edit mode

Thank you! I have already tried this, it didn't work.

ADD REPLY
1
Entering edit mode
4.7 years ago
5heikki 11k

Someone from Canada submitted like 50 genomes with nonsensical collection dates such as 2020-41-04

This is why we can't have nice things :(

edit. I know this is completely offtopic

ADD COMMENT
0
Entering edit mode

I thought there was curation before genomes were released into the database?

ADD REPLY
0
Entering edit mode

now think about how incompetent the GISAID people must be that they cannot automatically detect obviously wrong dates on submission.

ADD REPLY

Login before adding your answer.

Traffic: 2692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6