Split the sequences of a file containing around 10K fasta sequences and check each sequence for the CDS region
1
0
Entering edit mode
10.5 years ago

Hello All,

I am working on a database for the non AUG codons and I have a fasta file that I downloaded from refseq from an automated perl script. Now I want to check each and every sequence Individually of that file for the CDS region that have alternate translation initiation and hence print the Accession number from the header file.

To check the Alternate Initiation we need to check for the Kozak Context. -3 should be a purine and +4 a Guanine.

eg. AATAACTGGTTA. Counting from C at -3 is a purine and +4 is a G!

I'm new to perl programming. Can anyone help me ?

And I don't wanna use Bioperl so !

Thanks in Advance!

parsing perl • 2.0k views
ADD COMMENT
0
Entering edit mode
10.5 years ago
Woa ★ 2.9k

To extract the sequence from the RefSeq file I guess it is better to use a dedicated parser like that from Biperl, rather than writing something ad-hoc.

You can capture all the regex matches from the sequence string with something like this:

my @all_matches= ( $sq_str =~ /[AT]CTG[AG]/gi );
ADD COMMENT

Login before adding your answer.

Traffic: 1902 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6