Search through thousands of fasta sequences via the description field of the header using bioperl
2
0
Entering edit mode
10.4 years ago
cjy8709 • 0

Hi everyone

I have a fasta file where the header typically looks like this:

>FBpp0071678 type=protein;loc=2R:join(18050425..18051199,18052282..18052494,18056749..18058222,18058283..18059490,18059587..18059757,18059821..18059938,18060002..18060032); ID=FBpp0071678; name=a-PB; parent=FBgn0000008,FBtr0071764; dbxref=FlyBase_Annotation_IDs:CG6741-PB,FlyBase:FBpp0071678,GB_protein:AAF46809.2,REFSEQ:NP_524641,GB_protein:AAF46809; MD5=9eb6e9e4c12ec62fdeb31cca5b0683b6; length=1329; release=r5.57; species=Dmel;
________________________________________________________________________________________________________________________________________________________________________________________________________^----------------^

The file has thousands of sequences and I need to search according to the parent field of the header (I've highlighted it in the header I'm showing). Basically I'm searching a sequence based on the FBgn number.

I'm using bioperl's Bio::DB::Fasta module but from what I understand when the db indexes my database it only recognizes the first non-space characters as the identifier and than everything after the white space is ignored.

Anyways is there any function in bioperl to search the database through the description field of the header?

Or would I have to make a perl script to change all the headers so that it fits my current specific criteria.

Thanks!

fasta-search bioperl • 2.6k views
ADD COMMENT
3
Entering edit mode
10.4 years ago
Neilfws 49k

You can use the -makeid option. From the documentation:

The -makeid option gives you a chance to modify sequence IDs during indexing. The option value should be a code reference that will take a scalar argument and return a scalar result...

If you have a sequence object, $seq, you can get the part of the header after the space (which is called the description) like this:

$seq->desc

Then you could combine that with the ID in your -makeid subroutine, for indexing.

ADD COMMENT
1
Entering edit mode
10.4 years ago
Prakki Rama ★ 2.7k

If your fasta is oneline format, then you can try searching using fgrep.

If you have all your FBgn pattern in a file and use the following.

LC_ALL=C fgrep -A 1 -f patternList fasta.fa

else, if you do not have a pattern list, but want to search the sequences with pattern parent=FBgn in their header, and want to print out, then try

LC_ALL=C fgrep -A 1 'parent=FBgn' fasta.fa
ADD COMMENT
0
Entering edit mode

Thanks for the comment but I was looking for a way to index my DB with different identifiers in bioperl.

ADD REPLY

Login before adding your answer.

Traffic: 2400 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6