Help with perl script to subset a fasta file?
3
0
Entering edit mode
7.0 years ago
auninsaw ▴ 10

Hi - I have a large fasta file of ~500 sequences, each of which has a known to me target position with a SNP of interest. For each entry in the fasta file, I have a separate tab delimited text file that has on each line the fasta sequence name, start position, stop position, and SNP position. The fasta sequences and the positions text file are in the same order. The dummy fasta file is:

>AOS-94_S25_L002_R1_001_trimmed_contig_767
GACACACACTGATTGTTAGTGGTGTACAGACATTGCTTCAAACTGCA
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TAGGTTTTCTTTCCCATGTCCCCTGAATAACATGGGATTCCCTGTGACTGTGGGGACCCCTGAGAGCCTGGT
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
GATAAGGAGCTCACAGCAACCCACATGAGTTGTCC

...and the dummy positions file is:

AOS-94_S25_L002_R1_001_trimmed_contig_767   5   15  10
AOS-94_S25_L002_R1_001_trimmed_contig_2199  8   19  11
AOS-94_S25_L002_R1_001_trimmed_contig_2585  4   20  18

...where I want to extract positions 5-15 of the first sequence, and the snp is at position 10. Then, extract sequence 2 at bp 8-19, where the SNP is at position 11, etc.

This is the script I wrote:

#!/usr/bin/perl
use strict;
use warnings;
#Read in the complete fasta file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;
#Read in file of contig name, start pos, stop pos, SNP pos in tab delimited text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;
#Output file
print "What are we calling the output? \n";
my $out= <STDIN>;
chomp $out;
local $/ = "\n>"; #Read by fasta record
my $seq1 = (); 
open(FASTA,$fasta) || die "\n Unable to open the file!\n";
open(POS,$text) || die "\n Unable to open the file! \n";
my @fields = <POS>;
    while (my $seq = <FASTA>){
        chomp $seq;
        my @seq = split(/\n/,$seq);
            if($seq[0] =~ /^>/){
                $seq1 = $seq[0];
            }elsif($seq[0] =~ /[^>]/){ #matches any character except the >
                $seq1 = ">".$seq[0];
            }
    for my $pos (@fields){
        chomp $pos;
        my @field = split(/\t/,$pos);
    open(OUTFILE,">>$out");
    print OUTFILE "$seq1";
    my $subseq = substr $seq[1], $field[1] -1, $field[2] - $field[1]; 
    print OUTFILE "$subseq\n";
    }   
}
close FASTA;
close POS;
close OUTFILE;

It works in that I get the reduced fasta file I want. However, I want to enter these into batch primer for primer design. On each line after the fasta header, I need to have pos=(whatever tab 4 is, the position of the SNP). If I insert "print OUTFILE "$field[3]\n" after print OUTFILE "$seq[1]" it doesn't work right. I've screwed something up in my loop structure and chomping. Obviously I am struggling with some perl basics. Does anyone have any hints? Thanks for any insight/help.

perl SNP fasta script • 3.1k views
ADD COMMENT
0
Entering edit mode

Have you started working on this script on your own? You might find better advice for scripting on stack exchange.

ADD REPLY
0
Entering edit mode

I've started on it, but its admittedly a mess. I was having some trouble getting it posted up correctly but it looks ok now. I may try Stackexchange, thanks for the suggestion.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
7.0 years ago
auninsaw ▴ 10

Okay, here is a well organized answer I received from stack overflow, that solves my question using Perl and may help others with a similar problem. It is similar to my original in spirit but with correct formatting, comments, and pragma. Thanks for those who offered help.

#!/usr/bin/perl
use strict;
use warnings;

print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );

print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );

print "Name of the output file: ";
chomp( my $out_file = <STDIN> );

open my $out_fh, '>', $out_file or die qq{Unable to open "out_file" for output: $!};

my @pos_records = do {
    open my $pos_, '<' , $pos_file or die qq{Unable to open "$pos_file" for input: $!};
    <$pos_>;
};
chomp @pos_records; #remove all newlines  

{
     open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};

     local $/ = "\n>"; #Reading FASTA format now

     for ( <$fasta_fh> ) {

         chomp; #Remove ">\n" from the end

         my ( $header, $seq) = split /\n/; #separate the two lines

         $header = ">$header" unless $header =~ /^>/; # Replace any chomped >


     for ( @pos_records ) {

                my ($name,$beg,$end,$pos) = split /\t/;
                 my $subseq = substr $seq, $beg-1, $end-$beg;
             my $final_SNP = $end - $pos; 

                 if($header =~ /$name/){

                   print $out_fh "$header\n";
                   print $out_fh "pos=$final_SNP\n";
                   print $out_fh "$subseq\n";
      }
    } 
  }
} #local expires here

close $out_fh or die $!;
ADD COMMENT
1
Entering edit mode

This way loops through all the the pos_records once for every fasta entry. That's not a terribly smart way to do this.

ADD REPLY
1
Entering edit mode
7.0 years ago

assuming there is two lines per fasta file and the delimiter is the tabulation: use join and pipe into awk

join -t $'\t' -1 1 -2 1  \
      <(cat input.fa | paste - - | cut -c 2- | sort -t $'\t' -k1,1) \
      <(sort -t $'\t' -k1,1 jeter.txt) |\
awk '{printf(">%s %s %s %s\n%s\n",$1,$3,$4,$5,substr($2,int($3),int($4)-int($3)));}'

>AOS-94_S25_L002_R1_001_trimmed_contig_2199 8 19 11
TCTTTCCCATG
>AOS-94_S25_L002_R1_001_trimmed_contig_2585 4 20 18
AAGGAGCTCACAGCAA
>AOS-94_S25_L002_R1_001_trimmed_contig_767 5 15 10
CACACTGATT
ADD COMMENT
0
Entering edit mode

Thanks, Pierre. I am not that familiar with awk, but I will try and work through what you posted and figure out how it is working.

ADD REPLY
1
Entering edit mode
7.0 years ago

Yeah, I don't think your loops are right. You are going through the entire pos file within the fasta loop, and I don't see where you are specificing that you want only the line whose name matches the fasta name..

Personally, I'd arrange it something like this:

open (POS, "pos.txt") or die "Arrg!";
while (<POS)>) {
  chomp;
  my @pos = split "\t";
  $hash{$pos[0]}{'start'} = $pos[1];
  $hash{$pos[0]}{'end'}   = $pos[2];
  $hash{$pos[0]}{'snp'}   = $pos[3];
}

Then go through your fasta, and you have easy access to the numbers you need for each particular sequence.

Also a clever way to go through fasta files...change the line separator to ">" instead of new line. ( copied this from here:http://seqanswers.com/forums/showthread.php?t=24777)

$/ = ">";
my $junk = <IN>;
while ( my $record = <IN> ) {
  chomp $record;
  my ($defLine, @seqLines) = split /\n/, $record;
  my $sequence = join('',@seqLines);
}
ADD COMMENT
0
Entering edit mode

Thanks for the suggestion, I'm going to keep working at this and will post up when I get it working.

ADD REPLY

Login before adding your answer.

Traffic: 1997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6