Question

Help with perl script to subset a fasta file?

0

Entering edit mode

7.0 years ago

auninsaw ▴ 10

Hi - I have a large fasta file of ~500 sequences, each of which has a known to me target position with a SNP of interest. For each entry in the fasta file, I have a separate tab delimited text file that has on each line the fasta sequence name, start position, stop position, and SNP position. The fasta sequences and the positions text file are in the same order. The dummy fasta file is:

>AOS-94_S25_L002_R1_001_trimmed_contig_767
GACACACACTGATTGTTAGTGGTGTACAGACATTGCTTCAAACTGCA
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TAGGTTTTCTTTCCCATGTCCCCTGAATAACATGGGATTCCCTGTGACTGTGGGGACCCCTGAGAGCCTGGT
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
GATAAGGAGCTCACAGCAACCCACATGAGTTGTCC

...and the dummy positions file is:

AOS-94_S25_L002_R1_001_trimmed_contig_767   5   15  10
AOS-94_S25_L002_R1_001_trimmed_contig_2199  8   19  11
AOS-94_S25_L002_R1_001_trimmed_contig_2585  4   20  18

...where I want to extract positions 5-15 of the first sequence, and the snp is at position 10. Then, extract sequence 2 at bp 8-19, where the SNP is at position 11, etc.

This is the script I wrote:

#!/usr/bin/perl
use strict;
use warnings;
#Read in the complete fasta file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;
#Read in file of contig name, start pos, stop pos, SNP pos in tab delimited text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;
#Output file
print "What are we calling the output? \n";
my $out= <STDIN>;
chomp $out;
local $/ = "\n>"; #Read by fasta record
my $seq1 = (); 
open(FASTA,$fasta) || die "\n Unable to open the file!\n";
open(POS,$text) || die "\n Unable to open the file! \n";
my @fields = <POS>;
    while (my $seq = <FASTA>){
        chomp $seq;
        my @seq = split(/\n/,$seq);
            if($seq[0] =~ /^>/){
                $seq1 = $seq[0];
            }elsif($seq[0] =~ /[^>]/){ #matches any character except the >
                $seq1 = ">".$seq[0];
            }
    for my $pos (@fields){
        chomp $pos;
        my @field = split(/\t/,$pos);
    open(OUTFILE,">>$out");
    print OUTFILE "$seq1";
    my $subseq = substr $seq[1], $field[1] -1, $field[2] - $field[1]; 
    print OUTFILE "$subseq\n";
    }   
}
close FASTA;
close POS;
close OUTFILE;

It works in that I get the reduced fasta file I want. However, I want to enter these into batch primer for primer design. On each line after the fasta header, I need to have pos=(whatever tab 4 is, the position of the SNP). If I insert "print OUTFILE "$field[3]\n" after print OUTFILE "$seq[1]" it doesn't work right. I've screwed something up in my loop structure and chomping. Obviously I am struggling with some perl basics. Does anyone have any hints? Thanks for any insight/help.

perl SNP fasta script • 3.1k views

ADD COMMENT • link 7.0 years ago by auninsaw ▴ 10

0

Entering edit mode

Have you started working on this script on your own? You might find better advice for scripting on stack exchange.

ADD REPLY • link 7.0 years ago by The Last Word ▴ 230

0

Entering edit mode

I've started on it, but its admittedly a mess. I was having some trouble getting it posted up correctly but it looks ok now. I may try Stackexchange, thanks for the suggestion.

ADD REPLY • link 7.0 years ago by auninsaw ▴ 10

0

Entering edit mode

see also: Extract User Defined Region From An Fasta File

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

7.0 years ago

Pierre Lindenbaum 164k

assuming there is two lines per fasta file and the delimiter is the tabulation: use join and pipe into awk

join -t $'\t' -1 1 -2 1  \
      <(cat input.fa | paste - - | cut -c 2- | sort -t $'\t' -k1,1) \
      <(sort -t $'\t' -k1,1 jeter.txt) |\
awk '{printf(">%s %s %s %s\n%s\n",$1,$3,$4,$5,substr($2,int($3),int($4)-int($3)));}'

>AOS-94_S25_L002_R1_001_trimmed_contig_2199 8 19 11
TCTTTCCCATG
>AOS-94_S25_L002_R1_001_trimmed_contig_2585 4 20 18
AAGGAGCTCACAGCAA
>AOS-94_S25_L002_R1_001_trimmed_contig_767 5 15 10
CACACTGATT

ADD COMMENT • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks, Pierre. I am not that familiar with awk, but I will try and work through what you posted and figure out how it is working.

ADD REPLY • link 7.0 years ago by auninsaw ▴ 10

1

Entering edit mode

7.0 years ago

swbarnes2 14k

Yeah, I don't think your loops are right. You are going through the entire pos file within the fasta loop, and I don't see where you are specificing that you want only the line whose name matches the fasta name..

Personally, I'd arrange it something like this:

open (POS, "pos.txt") or die "Arrg!";
while (<POS)>) {
  chomp;
  my @pos = split "\t";
  $hash{$pos[0]}{'start'} = $pos[1];
  $hash{$pos[0]}{'end'}   = $pos[2];
  $hash{$pos[0]}{'snp'}   = $pos[3];
}

Then go through your fasta, and you have easy access to the numbers you need for each particular sequence.

Also a clever way to go through fasta files...change the line separator to ">" instead of new line. ( copied this from here:http://seqanswers.com/forums/showthread.php?t=24777)

$/ = ">";
my $junk = <IN>;
while ( my $record = <IN> ) {
  chomp $record;
  my ($defLine, @seqLines) = split /\n/, $record;
  my $sequence = join('',@seqLines);
}

ADD COMMENT • link 7.0 years ago by swbarnes2 14k

0

Entering edit mode

Thanks for the suggestion, I'm going to keep working at this and will post up when I get it working.

ADD REPLY • link 7.0 years ago by auninsaw ▴ 10

score 0 · Accepted Answer · 2017-12-21

Okay, here is a well organized answer I received from stack overflow, that solves my question using Perl and may help others with a similar problem. It is similar to my original in spirit but with correct formatting, comments, and pragma. Thanks for those who offered help.

#!/usr/bin/perl
use strict;
use warnings;

print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );

print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );

print "Name of the output file: ";
chomp( my $out_file = <STDIN> );

open my $out_fh, '>', $out_file or die qq{Unable to open "out_file" for output: $!};

my @pos_records = do {
    open my $pos_, '<' , $pos_file or die qq{Unable to open "$pos_file" for input: $!};
    <$pos_>;
};
chomp @pos_records; #remove all newlines  

{
     open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};

     local $/ = "\n>"; #Reading FASTA format now

     for ( <$fasta_fh> ) {

         chomp; #Remove ">\n" from the end

         my ( $header, $seq) = split /\n/; #separate the two lines

         $header = ">$header" unless $header =~ /^>/; # Replace any chomped >


     for ( @pos_records ) {

                my ($name,$beg,$end,$pos) = split /\t/;
                 my $subseq = substr $seq, $beg-1, $end-$beg;
             my $final_SNP = $end - $pos; 

                 if($header =~ /$name/){

                   print $out_fh "$header\n";
                   print $out_fh "pos=$final_SNP\n";
                   print $out_fh "$subseq\n";
      }
    } 
  }
} #local expires here

close $out_fh or die $!;