Help With Bp_Fetch
3
0
Entering edit mode
10.7 years ago
chodar ▴ 10

Hi,

I have a big multifasta file and I need get the sequences of many IDs, provide me in a different file. The problem is that many of this IDs are just a partial string of the original ID sequence used when the index was created. To be more clear. I need find the sequence correspond to "ID1" in a index created from a fasta file where the name for this sequence is ">xxxxx | ID1 | zzzzzz"

Can be regular expressions used with bp_fetch?

Thanks,

christian.

bioperl • 3.3k views
ADD COMMENT
1
Entering edit mode
10.7 years ago

You could use awk:

$ awk -v id="ID1" \
    ' \
    BEGIN { \
        needleFlag = 0; \
    } \
    { \
        if (/>/) { \
            if (needleFlag) { \
                exit 0; \
            } \
            n = split($0, a, "|"); \
            for (i = 1; i <= n; i++) { \
                p = "^ "id" $"; \
                m = match(a[i], p); \
                if (m) { \
                    needleFlag = 1; \
                    break; \
                } \
            } \
        } \
        if (needleFlag) { \
            print $0; \
        } \
    }' foo.fa > foo-ID1.fa

With the use of the -v option, this could be relatively easy to script.

This reads through the entire file until it finds and reports the sequence with the unique ID name, so it is probably useful for only one sequence.

If you need to search for more than one ID, then consider an approach with Python or Perl where it will be easier to do a regex on an array or list of identifier names.

ADD COMMENT
0
Entering edit mode

But, and correct me if Im wrong, you reconstruct the fasta file, modifing the IDs in conflict. That's imply that I have to rebuild the fasta index to do the bp_fetch. Im wrong?

ADD REPLY
0
Entering edit mode

I'm not familiar with bp_fetch, but if you want to extract a FASTA record from a large file based on the id name, that could be one approach. The output is written to a separate file called foo-ID1.fa.

ADD REPLY
0
Entering edit mode

+1 Nice awk command :-)

ADD REPLY
1
Entering edit mode
10.7 years ago
SES 8.6k

I agree with Neilfws, but I'll tell you what I would do. The script bp_index.pl will read from stdin, so my suggestion would be to modify the query file with a simple one-liner to match your ID list and pipe that to bp_index.pl. Then you can use bp_fetch.pl as normal with your list of IDs. This will require you reconstructing the index but that will take much less time than writing a custom script for this one off task.

ADD COMMENT
0
Entering edit mode
10.7 years ago
Neilfws 49k

The short answer to this question is "no". You must supply the same ID to bp_fetch as was used to create the index.

The simplest solution would be to extract the original IDs from the sequences and map those to the partial IDs in the query file. Or else recreate the query file so as to contain the complete IDs.

ADD COMMENT

Login before adding your answer.

Traffic: 1596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6