Question

Help With Bp_Fetch

0

Entering edit mode

11.4 years ago

chodar ▴ 10

Hi,

I have a big multifasta file and I need get the sequences of many IDs, provide me in a different file. The problem is that many of this IDs are just a partial string of the original ID sequence used when the index was created. To be more clear. I need find the sequence correspond to "ID1" in a index created from a fasta file where the name for this sequence is ">xxxxx | ID1 | zzzzzz"

Can be regular expressions used with bp_fetch?

Thanks,

christian.

bioperl • 3.6k views

ADD COMMENT • link updated 11.4 years ago by SES 8.6k • written 11.4 years ago by chodar ▴ 10

score 1 · Answer 1 · 2014-04-09

1

Entering edit mode

11.4 years ago

Alex Reynolds 36k

You could use awk:

$ awk -v id="ID1" \
    ' \
    BEGIN { \
        needleFlag = 0; \
    } \
    { \
        if (/>/) { \
            if (needleFlag) { \
                exit 0; \
            } \
            n = split($0, a, "|"); \
            for (i = 1; i <= n; i++) { \
                p = "^ "id" $"; \
                m = match(a[i], p); \
                if (m) { \
                    needleFlag = 1; \
                    break; \
                } \
            } \
        } \
        if (needleFlag) { \
            print $0; \
        } \
    }' foo.fa > foo-ID1.fa

With the use of the -v option, this could be relatively easy to script.

This reads through the entire file until it finds and reports the sequence with the unique ID name, so it is probably useful for only one sequence.

If you need to search for more than one ID, then consider an approach with Python or Perl where it will be easier to do a regex on an array or list of identifier names.

ADD COMMENT • link 11.4 years ago by Alex Reynolds 36k

0

Entering edit mode

But, and correct me if Im wrong, you reconstruct the fasta file, modifing the IDs in conflict. That's imply that I have to rebuild the fasta index to do the bp_fetch. Im wrong?

ADD REPLY • link 11.4 years ago by chodar ▴ 10

0

Entering edit mode

I'm not familiar with bp_fetch, but if you want to extract a FASTA record from a large file based on the id name, that could be one approach. The output is written to a separate file called foo-ID1.fa.

ADD REPLY • link 11.4 years ago by Alex Reynolds 36k

0

Entering edit mode

+1 Nice awk command :-)

ADD REPLY • link 11.4 years ago by PoGibas 5.1k

score 1 · Answer 2 · 2014-04-09

I agree with Neilfws, but I'll tell you what I would do. The script bp_index.pl will read from stdin, so my suggestion would be to modify the query file with a simple one-liner to match your ID list and pipe that to bp_index.pl. Then you can use bp_fetch.pl as normal with your list of IDs. This will require you reconstructing the index but that will take much less time than writing a custom script for this one off task.

score 0 · Answer 3 · 2014-04-09

The short answer to this question is "no". You must supply the same ID to bp_fetch as was used to create the index.

The simplest solution would be to extract the original IDs from the sequences and map those to the partial IDs in the query file. Or else recreate the query file so as to contain the complete IDs.