Question

Extracting A Sequence By Position Using Perl

1

Entering edit mode

13.8 years ago

Jimmyk ▴ 20

Hi guys, How do i extract a sequences from a fasta file by taking the start and end position from a gene predicted file: here the example the file with the orf statistics is my predicted file and for example the start position for the first orf is 65 and the end is 213. and the fasta file i'm going to search those position is the other one

my predicted file looks like this

>Seq1 [organism=S.burgodofry...
orf00001       65      213  +1     2.93
orf00002      799     2328  +1     7.09
orf00003     2331     3437  +3     6.09
orf00004     3457     4044  +1     6.15
>Seq2 [organism=S.burgodofry...
orf00001       55      317  +1     2.17
orf00002      206      610  +2     5.28
orf00003      747     2408  +3     4.85

and my fasta sequence sequence look like this:

>Seq1 [organism=S.burgodofry]...
ACTGTAGATGACATGACCAGTACGATACAGAT...
....
........
>Seq2 [organism=.....]
ATGTCGTGACTAGTACGATCAGATCAGAT
.........................
..............
...

perl fasta sequence retrieval • 5.0k views

ADD COMMENT • link updated 13.7 years ago by lexnederbragt ★ 1.3k • written 13.8 years ago by Jimmyk ▴ 20

1

Entering edit mode

You don't say which fields in your gene prediction file correspond to start and end. And that isn't a Fasta file, my friend. What have you tried so far?

ADD REPLY • link 13.8 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

my bad, the file with the orf statistics are is my predicted file and for example the start position for the first orf is 65 and the end is 213. and the fasta file i'm going to search those position is the other one

ADD REPLY • link 13.8 years ago by Jimmyk ▴ 20

score 3 · Answer 1 · 2011-02-21

3

Entering edit mode

13.8 years ago

David L. ▴ 110

If you don't mind using BioPerl, you can index your fasta file with Bio::Index::Fasta or Bio::DB::Fasta. You can retrieve the sequence as a Bio::Seq object from the index and use the subseq method to extract the sequence between start and end position.

The BioPerl Tutorial has a [?]section[?] about Bio::Index::Fasta/Bio::DB::Fasta with sample code.

ADD COMMENT • link 13.8 years ago by David L. ▴ 110

0

Entering edit mode

The thing is i'm new for programming, except some perl reading

ADD REPLY • link 13.8 years ago by Jimmyk ▴ 20

0

Entering edit mode

Time to put that reading into practice then :-)

ADD REPLY • link 13.8 years ago by Neilfws 49k

score 2 · Answer 2 · 2011-02-21

2

Entering edit mode

13.8 years ago

Michael Kuhn 5.0k

I've had a similar problem before when I had to extract gene predictions from a GFF3 file. You can try to adapt the answers given in the FriendFeed thread, though the answer uses BioPython (oh, the times before BioStar...).

ADD COMMENT • link 13.8 years ago by Michael Kuhn 5.0k

Ram · Answer 3 · 2011-02-22

"Beginner's" perl way:

Reading the table using the split function to get the column values in a list (do you need to adjust for the frame or is the starting position given 'in-frame'?)
Put start and stop positions in a hash (to keep things simple you could use SeqX_orf0000Y as keys)
Parsing fasta files with perl: see these answers
Getting the relevant portion of the sequence using the substr function

More complicated ways involve complex data structures, BioPerl etc.