simple question - although when I search google all i get is how to extract the actual sequence. anyone have a quick solution as how to read in a fasta file, and then extract all the ids in the same order they appear in the fasta file?
If you don't mind keeping the ">" and easy grep would be a start.
grep ">" input.fasta > headers.txt.
But some more information would be great. What did you find for extracting the sequence? Do you want to write a script doing it? Then it might still help to have a closer look on how to extract the sequence and change it to extract the header. Do you want the full header? And, as we currently are in discussion about that topic in another post: What did you already try?
Note that the ">" was not visible in your original question. Lines beginning with that character are formatted as blockquotes at BioStar. You need to indent the line with 4 spaces (done for you) to display it properly.
You can use grep and filter lines that start with '>'. If you have a more complex scenario (i.e. it also contains description info) you'll have to generate the regular expression that best fits you.
UPDATE
The code for your data should be something like:
You could use some kind of Perl parsing script that identifies each line that begins >, which will be the FASTA header. You can then turn the line into the array and identify the ID from its position in the array (assuming the ID appears in the same position in every header) and print the ID. For example:
#!/usr/bin/perl
use strict;
use warnings;
# open the input file
open (FASTA, '<fasta.fa') or die $!;
# move through the lines of the input file, one by one
while (<FASTA>) {
# look for header lines by finding > at the beginning
if ( /^>/ ) {
# get the ID from the header
my @header = split /\s/, $_;
my @array = split /\|/, @header[0];
my $id = @array[2];
# print ID
print $id, "\n";
}
}
# close the file
close FASTA;
@Emily_Ensembl - i seem to be getting a list of "n" 's when running this. so i can see it is walking through the file, but doesn't seem to strip out the id.
There are a couple of things in your script that will cause problems. When you want a single element from an array, remember it is just a regular scalar variable so you would write $array[0] and not @array[0]. The syntax you used is for a slice and that is not what you want (because you have warnings enabled, Perl will tell you that). Also, this is bad because assigning array elements to a scalar with a slice will usually give you a "Use of uninitialized value ..." warning if you try to use it (this assignment actually worked with a single element but this is not good form). You can avoid this issue because you can print array or hash elements directly, so there is no need in the extra step.
The last thing is to never use bare filehandles and use the 3-argument version of open. These are considered "best practices" because they are safer ways to deal with files. Hope that helps.
If you don't mind keeping the ">" and easy grep would be a start. grep ">" input.fasta > headers.txt.
But some more information would be great. What did you find for extracting the sequence? Do you want to write a script doing it? Then it might still help to have a closer look on how to extract the sequence and change it to extract the header. Do you want the full header? And, as we currently are in discussion about that topic in another post: What did you already try?
hi - updated with a snippet of the fasta file. out of that snippet i only want the 'P0A334' part, and then repeat for the other sequences.
Note that the ">" was not visible in your original question. Lines beginning with that character are formatted as blockquotes at BioStar. You need to indent the line with 4 spaces (done for you) to display it properly.