I am a beginner with Blast+.I am using Windows.My aim as of now is to download the nr protein sequence in Fasta format and then format it using makeblastdb.then extract the first 1000 characters from the nr file as a seperate file (say qa.fasta) and then query it against the whole database.
Now i downloaded the nr database in Fasta format from this link
In the long term switching to using a UNIX style system may make sense. However there is a learning curve to take into account... I suggest trying a biology targeted Linux distribution, see http://en.wikipedia.org/wiki/BioLinux, in a virtual machine, for example using VirtualBox (https://www.virtualbox.org/) as a starting point.
Hi, first, I'm not sure "original" is the good term, but if you mean: "do these fasta files correspond exactly to the official nr db sequences?" the answer is yes. Second, the fact the db files are splitted is a normal behavior. Nevertheless, I have a doubt the db building process worked until the end: personally, I 've never tried on nr but NCBI provides the nr ready-to-go blastdb that iterates until nr.05. . Do you have the alias file (nr.pal) created?
Finally, as Geparada told you, fasta files are text files. So open it with any text editor (better than processor BTW, you don't want any grammar correction, or a Times New Roman font for ids and Arial Italic for sequences, and more importantly, you want to save your first 1000 aa as text, not doc, rtf... ). The difficulty is actually not the type of file, but the size. I've never tried on windows, but a former coworker used Notepad++ and seemed to be happy with this one.
The 'nr' BLAST database from NCBI contains additional information not present in the fasta sequence format data, since it is generated from the ASN.1. In order to ensure maximum compatibility it is likely a smaller part size is also used by NCBI, this avaoids problems with some filesystems. So it isn't surprising that a manual generation would give fewer parts.
If you want to stick with Windows, use gvim, or something like it for Windows. It's more powerful than a Notepad, it has no problem handling very large text files (and I think it's easier on the eyes than Notepad)
+1. And also Windows/OSX native text editors all treat some characters (whitespace) a bit differently. Linebreak is 'n' in unix, but r in osx for example.
I did not get why you didn't directly downloaded the preformatted databases from ncbi in the first place? You can blast against it directly and literally get any info from it using the provided utilities. Even on winhoo$.
At best try to use an editor that can handle line-endings conversion (they are different for windhoos en unix and some tools will fail with incorrect line endings. Not all windows-2-unix convert these accuratly. I personally prefere notepad++ where you can interconvert line endings as well).
When opening large fasta files, I have been more than satisfied with JWrite. All other editors used to crash from time to time, especially when handling really large datasets.
Thanks for the replies.Apologies for being late to get back.
I am working on a research project with my professor.Thats y i downloaded the fasta files as i was asked to do so :)
The file is too big to be opened by windows (by any editor) and hence i need to extract the first 1000 chars just to take one sequence so that i can do a blast using a test query.
Manu Prestat - Yes i have the nr.pal file created.
Why do you need the first 1000 char? Why did you put bioperl in the tags?
I've removed the bioperl tag.