I just downloaded the .dat.gz files and gunziped them. I am now wondering how I can obtain the .fasta sequences for all the sequences within and then in a seperate file; also all of the useful info like associated GO terms, gene names, IPR terms etc.
How do people normally do this?
Thanks.
Check the README file and then download the data from correct folders here.
Yes, I already read; however, them .dat files are flat files. I don't really have much of a clue how to sort them into .fasta file for sequences and another tab delimited file for associated info.
That is the point. I am not sure why you got the .dat files when the files you want are in a different directory
1) Directory /current_release/knowledgebase
These are the .fasta files for the complete DB. I am just after plants. Are you familiar with uniprot, if so, is there a difference in downloading the files on the ftp server and doing a query search on the website. Using a query and downloading all of the viridiplantae taxonomy?
I wonder if there are differences between the two.
The difference between the two is just as you describe it. You would need to do additional work to parse things you need from the complete DB where as a query on the site does that for you.
A search via web only allows you to select 400 entries at a time so unless you have a ton of patience your only option is to get the full database and parse the data yourself.
I was able to do a query and download the 3 million odd sequences of viridiplantae at once.. handy.
HOWEVER... when i COMPARED the query search sequences to the actual viridiplantae taxonomic divisions flat files, extra taxa where included such as plant-associated pathogen taxa, rhodophyta etc...
So both are different.
to be clear, more sequences are available through the flat files under ftp taxanomic divisions...