Hi - I have a large fasta file of ~500 sequences, each of which has a known to me target position with a SNP of interest. For each entry in the fasta file, I have a separate tab delimited text file that has on each line the fasta sequence name, start position, stop position, and SNP position. The fasta sequences and the positions text file are in the same order. The dummy fasta file is:
>AOS-94_S25_L002_R1_001_trimmed_contig_767
GACACACACTGATTGTTAGTGGTGTACAGACATTGCTTCAAACTGCA
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TAGGTTTTCTTTCCCATGTCCCCTGAATAACATGGGATTCCCTGTGACTGTGGGGACCCCTGAGAGCCTGGT
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
GATAAGGAGCTCACAGCAACCCACATGAGTTGTCC
...and the dummy positions file is:
AOS-94_S25_L002_R1_001_trimmed_contig_767 5 15 10
AOS-94_S25_L002_R1_001_trimmed_contig_2199 8 19 11
AOS-94_S25_L002_R1_001_trimmed_contig_2585 4 20 18
...where I want to extract positions 5-15 of the first sequence, and the snp is at position 10. Then, extract sequence 2 at bp 8-19, where the SNP is at position 11, etc.
This is the script I wrote:
#!/usr/bin/perl
use strict;
use warnings;
#Read in the complete fasta file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;
#Read in file of contig name, start pos, stop pos, SNP pos in tab delimited text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;
#Output file
print "What are we calling the output? \n";
my $out= <STDIN>;
chomp $out;
local $/ = "\n>"; #Read by fasta record
my $seq1 = ();
open(FASTA,$fasta) || die "\n Unable to open the file!\n";
open(POS,$text) || die "\n Unable to open the file! \n";
my @fields = <POS>;
while (my $seq = <FASTA>){
chomp $seq;
my @seq = split(/\n/,$seq);
if($seq[0] =~ /^>/){
$seq1 = $seq[0];
}elsif($seq[0] =~ /[^>]/){ #matches any character except the >
$seq1 = ">".$seq[0];
}
for my $pos (@fields){
chomp $pos;
my @field = split(/\t/,$pos);
open(OUTFILE,">>$out");
print OUTFILE "$seq1";
my $subseq = substr $seq[1], $field[1] -1, $field[2] - $field[1];
print OUTFILE "$subseq\n";
}
}
close FASTA;
close POS;
close OUTFILE;
It works in that I get the reduced fasta file I want. However, I want to enter these into batch primer for primer design. On each line after the fasta header, I need to have pos=(whatever tab 4 is, the position of the SNP). If I insert "print OUTFILE "$field[3]\n" after print OUTFILE "$seq[1]"
it doesn't work right. I've screwed something up in my loop structure and chomping. Obviously I am struggling with some perl basics. Does anyone have any hints? Thanks for any insight/help.
Have you started working on this script on your own? You might find better advice for scripting on stack exchange.
I've started on it, but its admittedly a mess. I was having some trouble getting it posted up correctly but it looks ok now. I may try Stackexchange, thanks for the suggestion.
see also: Extract User Defined Region From An Fasta File