I have a big multifasta file and I need get the sequences of many IDs, provide me in a different file. The problem is that many of this IDs are just a partial string of the original ID sequence used when the index was created.
To be more clear. I need find the sequence correspond to "ID1" in a index created from a fasta file where the name for this sequence is ">xxxxx | ID1 | zzzzzz"
$ awk -v id="ID1" \
' \
BEGIN { \
needleFlag = 0; \
} \
{ \
if (/>/) { \
if (needleFlag) { \
exit 0; \
} \
n = split($0, a, "|"); \
for (i = 1; i <= n; i++) { \
p = "^ "id" $"; \
m = match(a[i], p); \
if (m) { \
needleFlag = 1; \
break; \
} \
} \
} \
if (needleFlag) { \
print $0; \
} \
}' foo.fa > foo-ID1.fa
With the use of the -v option, this could be relatively easy to script.
This reads through the entire file until it finds and reports the sequence with the unique ID name, so it is probably useful for only one sequence.
If you need to search for more than one ID, then consider an approach with Python or Perl where it will be easier to do a regex on an array or list of identifier names.
But, and correct me if Im wrong, you reconstruct the fasta file, modifing the IDs in conflict. That's imply that I have to rebuild the fasta index to do the bp_fetch. Im wrong?
I'm not familiar with bp_fetch, but if you want to extract a FASTA record from a large file based on the id name, that could be one approach. The output is written to a separate file called foo-ID1.fa.
I agree with Neilfws, but I'll tell you what I would do. The script bp_index.pl will read from stdin, so my suggestion would be to modify the query file with a simple one-liner to match your ID list and pipe that to bp_index.pl. Then you can use bp_fetch.pl as normal with your list of IDs. This will require you reconstructing the index but that will take much less time than writing a custom script for this one off task.
The short answer to this question is "no". You must supply the same ID to bp_fetch as was used to create the index.
The simplest solution would be to extract the original IDs from the sequences and map those to the partial IDs in the query file. Or else recreate the query file so as to contain the complete IDs.
But, and correct me if Im wrong, you reconstruct the fasta file, modifing the IDs in conflict. That's imply that I have to rebuild the fasta index to do the bp_fetch. Im wrong?
I'm not familiar with bp_fetch, but if you want to extract a FASTA record from a large file based on the id name, that could be one approach. The output is written to a separate file called
foo-ID1.fa
.+1 Nice awk command :-)