How can I assign sequence length to an ID in a FASTA file using Perl?
1
0
Entering edit mode
8.0 years ago
SaltedPork ▴ 170

I took the sequences from a FASTA file and concatenated them to form one big sequence, which was the basis of the my research. I now have a series of coordinates (inside this concatenated sequence) that I am interested in.

I want to be able to find the original ID's of the sequences that match with the coordinates inside this concatenated sequence. I am currently writing a Perl script, does anyone have any suggestions?

#! /usr/bin/perl -w
use strict;
use Cwd;

my $input = $ARGV[0];
open (my $INPUT, "<$input") or die "unable to open $input";
while (<INPUT>) {
if( /^[AGCT]/ {

}
}
close $input;

Obviously my program isn't finishee, but i think i will try the Length function inside Perl and assign those to an array.

Perl fasta • 1.9k views
ADD COMMENT
0
Entering edit mode

Showing some example input, output would be helpful.

ADD REPLY
0
Entering edit mode

Input would be a standard Fasta file, and a file with coordinates in two columns (start-stop). Output would be a list of ID's that match to a set of coordinates i input

ADD REPLY
1
Entering edit mode
8.0 years ago

if you concatenate all sequences in one then you lose the original ID information. what your assignment probably wanted you to do is to store the lengths of the original sequences, and then extract which ID contains which position. if that is the case there are many things you could try that would depend on how skilful you are in perl, but I would go for storing cumulative lengths hash for each sequence with id, that I would have previously stored in an %idSeq hash. once you have all this information stored, you only need to loop through the positions requested looking if each position is below each sorted cumulative length:

foreach $id
 $totalLength += length($seq)
 $lengthId{$totalLength} = $id
foreach $pos
 foreach $length
  if ($length >= $pos) { print; last }
ADD COMMENT
0
Entering edit mode

Thanks for replying! I would assign the sequence to $seq, Id's to $id and then use those inside a hash, with the hash key as $lengthId?

ADD REPLY
0
Entering edit mode

that would be the idea. you can either store every pair of id and seq into a hash and then loop through it, or you could evaluate sequence lengths on the fly asking if there's any position inside each particular sequence.

ADD REPLY

Login before adding your answer.

Traffic: 1724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6