Question

Split the sequences of a file containing around 10K fasta sequences and check each sequence for the CDS region

0

Entering edit mode

10.5 years ago

vivek.arora15693 • 0

Hello All,

I am working on a database for the non AUG codons and I have a fasta file that I downloaded from refseq from an automated perl script. Now I want to check each and every sequence Individually of that file for the CDS region that have alternate translation initiation and hence print the Accession number from the header file.

To check the Alternate Initiation we need to check for the Kozak Context. -3 should be a purine and +4 a Guanine.

eg. AATAACTGGTTA. Counting from C at -3 is a purine and +4 is a G!

I'm new to perl programming. Can anyone help me ?

And I don't wanna use Bioperl so !

Thanks in Advance!

parsing perl • 2.0k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by vivek.arora15693 • 0

Ram · Answer 1 · 2014-06-10

0

Entering edit mode

10.5 years ago

Woa ★ 2.9k

To extract the sequence from the RefSeq file I guess it is better to use a dedicated parser like that from Biperl, rather than writing something ad-hoc.

You can capture all the regex matches from the sequence string with something like this:

my @all_matches= ( $sq_str =~ /[AT]CTG[AG]/gi );

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by Woa ★ 2.9k