Extracting A Sequence By Position Using Perl
3
1
Entering edit mode
13.8 years ago
Jimmyk ▴ 20

Hi guys, How do i extract a sequences from a fasta file by taking the start and end position from a gene predicted file: here the example the file with the orf statistics is my predicted file and for example the start position for the first orf is 65 and the end is 213. and the fasta file i'm going to search those position is the other one

my predicted file looks like this

>Seq1 [organism=S.burgodofry...
orf00001       65      213  +1     2.93
orf00002      799     2328  +1     7.09
orf00003     2331     3437  +3     6.09
orf00004     3457     4044  +1     6.15
>Seq2 [organism=S.burgodofry...
orf00001       55      317  +1     2.17
orf00002      206      610  +2     5.28
orf00003      747     2408  +3     4.85

and my fasta sequence sequence look like this:

>Seq1 [organism=S.burgodofry]...
ACTGTAGATGACATGACCAGTACGATACAGAT...
....
........
>Seq2 [organism=.....]
ATGTCGTGACTAGTACGATCAGATCAGAT
.........................
..............
...
perl fasta sequence retrieval • 5.0k views
ADD COMMENT
1
Entering edit mode

You don't say which fields in your gene prediction file correspond to start and end. And that isn't a Fasta file, my friend. What have you tried so far?

ADD REPLY
0
Entering edit mode

my bad, the file with the orf statistics are is my predicted file and for example the start position for the first orf is 65 and the end is 213. and the fasta file i'm going to search those position is the other one

ADD REPLY
3
Entering edit mode
13.8 years ago
David L. ▴ 110

If you don't mind using BioPerl, you can index your fasta file with Bio::Index::Fasta or Bio::DB::Fasta. You can retrieve the sequence as a Bio::Seq object from the index and use the subseq method to extract the sequence between start and end position.

The BioPerl Tutorial has a [?]section[?] about Bio::Index::Fasta/Bio::DB::Fasta with sample code.

ADD COMMENT
0
Entering edit mode

The thing is i'm new for programming, except some perl reading

ADD REPLY
0
Entering edit mode

Time to put that reading into practice then :-)

ADD REPLY
2
Entering edit mode
13.8 years ago

I've had a similar problem before when I had to extract gene predictions from a GFF3 file. You can try to adapt the answers given in the FriendFeed thread, though the answer uses BioPython (oh, the times before BioStar...).

ADD COMMENT
2
Entering edit mode
13.8 years ago
lexnederbragt ★ 1.3k

"Beginner's" perl way:

  1. Reading the table using the split function to get the column values in a list (do you need to adjust for the frame or is the starting position given 'in-frame'?)
  2. Put start and stop positions in a hash (to keep things simple you could use SeqX_orf0000Y as keys)
  3. Parsing fasta files with perl: see these answers
  4. Getting the relevant portion of the sequence using the substr function

More complicated ways involve complex data structures, BioPerl etc.

ADD COMMENT

Login before adding your answer.

Traffic: 1682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6