How Do I Extract Sequence Ids From Fasta File?
4
1
Entering edit mode
11.5 years ago
arronslacey ▴ 320

simple question - although when I search google all i get is how to extract the actual sequence. anyone have a quick solution as how to read in a fasta file, and then extract all the ids in the same order they appear in the fasta file?

here is a snippet:

>gnl|TC-DB|P0A334 1.A.1.1.1 Voltage-gated potassium channel OS=Streptomyces lividans GN=kcsA PE=1 SV=1

and I only want the 'P0A344' part.

link to fasta file:

https://docs.google.com/file/d/0B0iDswLYaZ0zX1RJdGRrRUxiSEk/edit?usp=sharing

thanks!

fasta protein id • 18k views
ADD COMMENT
1
Entering edit mode

If you don't mind keeping the ">" and easy grep would be a start. grep ">" input.fasta > headers.txt.

But some more information would be great. What did you find for extracting the sequence? Do you want to write a script doing it? Then it might still help to have a closer look on how to extract the sequence and change it to extract the header. Do you want the full header? And, as we currently are in discussion about that topic in another post: What did you already try?

ADD REPLY
0
Entering edit mode

hi - updated with a snippet of the fasta file. out of that snippet i only want the 'P0A334' part, and then repeat for the other sequences.

ADD REPLY
0
Entering edit mode

Note that the ">" was not visible in your original question. Lines beginning with that character are formatted as blockquotes at BioStar. You need to indent the line with 4 spaces (done for you) to display it properly.

ADD REPLY
7
Entering edit mode
11.5 years ago
 awk -F '[| ]' '/^>/ { print $3}' < your.fasta
ADD COMMENT
0
Entering edit mode

excellent - thanks!

ADD REPLY
4
Entering edit mode
11.5 years ago
Biojl ★ 1.7k

You can use grep and filter lines that start with '>'. If you have a more complex scenario (i.e. it also contains description info) you'll have to generate the regular expression that best fits you.

UPDATE The code for your data should be something like:

grep "^>" file.fa | cut -c 12-17 > destination_file.txt

Where 12-17 are the character positions you want. You can play with it.

ADD COMMENT
2
Entering edit mode

I would do the same but use cut with specific delimiters (-d) and columns (-f) in case the number of characters varies:

grep "^>" file.fa | cut -d'|' -f3 | cut -d ' ' -f1 > destination_file.txt

ADD REPLY
0
Entering edit mode

+1 for a more elegant solution.

ADD REPLY
3
Entering edit mode
11.5 years ago
Emily 24k

You could use some kind of Perl parsing script that identifies each line that begins >, which will be the FASTA header. You can then turn the line into the array and identify the ID from its position in the array (assuming the ID appears in the same position in every header) and print the ID. For example:

#!/usr/bin/perl
use strict;
use warnings;

# open the input file
open (FASTA, '<fasta.fa') or die $!;

# move through the lines of the input file, one by one
while (<FASTA>) {

    # look for header lines by finding > at the beginning
    if ( /^>/ ) {

        # get the ID from the header
        my @header = split /\s/, $_;
        my @array = split /\|/, @header[0];
        my $id = @array[2];

        # print ID
        print $id, "\n";

    }
}

# close the file
close FASTA;
ADD COMMENT
0
Entering edit mode

@Emily_Ensembl - i seem to be getting a list of "n" 's when running this. so i can see it is walking through the file, but doesn't seem to strip out the id.

here is the link to the fast file: https://docs.google.com/file/d/0B0iDswLYaZ0zX1RJdGRrRUxiSEk/edit?usp=sharing

ADD REPLY
0
Entering edit mode

Fixed above. The line "my @array = split /\|/, @header[0];" was missing a backslash so it was splitting on characters rather than |s.

ADD REPLY
0
Entering edit mode

There are a couple of things in your script that will cause problems. When you want a single element from an array, remember it is just a regular scalar variable so you would write $array[0] and not @array[0]. The syntax you used is for a slice and that is not what you want (because you have warnings enabled, Perl will tell you that). Also, this is bad because assigning array elements to a scalar with a slice will usually give you a "Use of uninitialized value ..." warning if you try to use it (this assignment actually worked with a single element but this is not good form). You can avoid this issue because you can print array or hash elements directly, so there is no need in the extra step.

The last thing is to never use bare filehandles and use the 3-argument version of open. These are considered "best practices" because they are safer ways to deal with files. Hope that helps.

ADD REPLY
0
Entering edit mode
11.5 years ago
Anima Mundi ★ 2.9k

In Python:

for line in open('inputfile'):
    if '>' in line:
        print line,
ADD COMMENT

Login before adding your answer.

Traffic: 1800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6