From the research paper I have encountered, GISAID and NCBI are the two common open databases / web services that people can use to download sequences data. Lately, I have started to use GISAID and met a certain issue. See how to download data from gisaid ?.
Despite this, I was trying to figure out the meaning of the data. So, in the readme.txt file, the fasta header format is as the following:
Gene name|Isolate name|YYYY-MM-DD|Isolate ID|Passage details/history|Type^^location/state|Host|Originating lab|Submitting lab|Submitter|Location(country)
Here is what I was trying to understand it:
Gene name : a specific gene name.
Isolate name: name of this specific isolate.
YYYY-MM-DD: isolate date.
Isolate ID: each isolate has an ID.
Passage details / history: what is passage details / history.
Type^^location / state: the sampling city.
Host: whether it is a human host or other animal host
Originating lab: the lab which originate this isolate
Submitting lab: the lab which submit this isolate
Submitter: the person who submit this isolate
Location(country): country of this isolate
So, did I understand this correctly ? What is passage details / history ? How about NCBI data sets? To do a basic phylogenetic analysis, I might only need the sequences, the sampling date and the sampling location ? What is your opinion ?
Then, I started to work on the file allprot1109.fasta. I'm not sure why is it named like this? So the first piece of data is like the following.
>NSP1|hCoV-19/Wuhan/WIV04/2019|2019-12-30|EPI_ISL_402124|Original|hCoV-19^^Hubei|Human|Wuhan Jinyintan Hospital|Wuhan Institute of Virology|Wuhan Institute of Virology|China
MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQENWNTKHSSGVTRELMRELNGG
So, is NSP1 the gene name? It does not look like that. It does not fit the data format displayed in the file README.txt. Can anyone please hep me to clarify this?
For simply research purpose, maybe I can ignore this, and only focus on the sampling/isolate location, sampling/isolate time, and sequences. So, in the above first piece, I'd like to maybe extract
Wuhan/2019-12-30/MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQENWNTKHSSGVTRELMRELN,GG
right? A simple script in Python/C/C++/Matlab can help to extract this. Could anyone bother to share some simple code for this initial elementary data preprocessing step?
One more question, I guess this sequence is an amino acid sequence, as the letters indicate. Is this correct? So, I might also need to transfer this into its nucleotide counterpart.
How about sequences data from NCBI ? Would that be better or worse ?
There is no better or worse. NCBI's COVID19 portal hosts 39K genomes. You can also get the proteins by clicking on
Protein
tab (425K+ entries as of now).GISAID is not an open database so there may not be many people here with experience with it. Try using open resources, you'll have more chances of finding people using them and be able to answer your questions, especially as you don't seem to care about where the data comes from.
Yes
NSP1 is the name of the protein since this is protein sequence.
If the virus underwent cell culture then it would indicate how many rounds.
You would only need a unique identifier per sequence. That other information is metadata for separate analyses.
NCBI and Nextstrain.org make pre-made phylogenetic analyses available. Do you have to do your own?
what do you mean by "pre-made phylogenetic analyses" ?
NCBI has already done the phylogenetic analyses for you. You can download the tree/alignment data. You can also upload your own sequence in that tool I linked.
What is the general procedure to do sequence analysis with SARS-CoV-2 data ?
It depends on the goal of the analysis. Have a look at the workflows for COVID-19 analysis on usegalaxy.*. You can also find SARS-CoV-2/COVID-19 related workflows on the WorkflowHub.
NextStrain makes a tutorial available for Genomic epidemiology of SARS-CoV-2 data.
I was reading the tutorial regarding "preparing your data" section. This link suggests using GISAID to download data for use. However, when I was trying to download data using GISAID there. It does not have the same format I saw several days ago. Certain files are missing. Now I can only see four files with names FASTA header format, allprot1118, spikeprot1118, and nextregions. Could anyone let me know why is that? Is it because GISAID data is not publicly accessible to anyone ?
Yes. You need to apply and they need to accept your application. The problem with GISAID is that you're not allowed to share the data not even in a publication. I believe this is bad scientific practice so I encourage you to use public data instead.