Question

Modifying perl script to not print redundant sequences and to fix length of sequences outputted from blast

0

Entering edit mode

8.3 years ago

mforthman ▴ 50

Cross-reference: http://stackoverflow.com/questions/38661509/modifying-perl-script-to-not-print-duplicates-and-extract-sequences-of-a-certain

Some solutions have been made in the cross-reference and I have updated this post to reflect those changes. Please see "Edit" sections of this post for update on issues I'm experiencing.

Hello,

I thought I should post some background first. I have a set of gene files that contain anywhere from one to five DNA sequences from different species. I used a bash shell script to perform blastn with each gene file as a query and a file of all transcriptome sequences (all_transcriptome_seq.fasta) from the five species as the subject. I now want to process these output files (and there are many) so that I can get all subject sequences that hit into one file per gene, with duplicate sequences removed (except to keep one), and ensure I'm getting the length of the sequences that actually hit the query.

Here is what the blastn output looks like for one gene file (columns: qseqid qlen sseqid slen qframe qstart qend sframe sstart send evalue bitscore pident nident length)

Acur_01000750.1_OFAS014956-RA-EXON04    248 Apil_comp17195_c0_seq1  1184    1   1   248 1   824 1072    2e-73    259    85.60   214 250
Acur_01000750.1_OFAS014956-RA-EXON04    248 Atri_comp5613_c0_seq1   1067    1   2   248 1   344 96  8e-97    337    91.16   227 249
Acur_01000750.1_OFAS014956-RA-EXON04    248 Acur_01000750.1 992 1   1   248 1   655 902 1e-133   459    100.00  248 248
Acur_01000750.1_OFAS014956-RA-EXON04    248 Btri_comp17734_c0_seq1  1001    1   1   248 1   656 905 5e-69    244    84.40   211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Atri_comp5613_c0_seq1   1067    1   2   250 1   344 96  1e-60    217    82.33   205 249
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Acur_01000750.1 992 1   1   250 1   655 902 5e-69    244    84.40   211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Btri_comp17734_c0_seq1  1001    1   1   250 1   656 905 1e-134   462    100.00  250 250

I've been working on a perl script that would, in short, take the sseqid column to pull out the corresponding sequences from the all_transcriptome_seq.fasta file, place these into a new file, and trim the transcripts to the sstart and send positions. Here is the script, so far (NOTE: THIS SCRIPT BELOW HAS BEEN EDITED BASED ON CHANGES FROM CROSS-REFERENCE SOLUTIONS):

#!/usr/bin/env perl

use warnings;
use strict;
use Data::Dumper;

############################################################################
# blastn_post-processing.pl v. 1.0 by Michael F., XXXXXX
############################################################################

my($progname) = $0;

############################################################################
# Initialize variables
############################################################################

my($jter);

my($com);
my($t1);

if ( @ARGV != 2 ) {
    print "Usage:\n  \$ $progname <infile> <transcriptomes>\n";
    print "  infile         = tab-delimited blastn text file\n";
    print "  transcriptomes = fasta file of all transcriptomes\n";
    print "exiting...\n";
    exit;
}

my($infile)=$ARGV[0];
my($transcriptomes)=$ARGV[1];

############################################################################
# Read the input file
############################################################################
print "Reading the input file... ";
open (my $INF, $infile) or die "Unable to open file";
my @data = <$INF>;
print @data;
close($INF) or die "Could not close file $infile.\n";
my($nlines) = $#data + 1;
my($inlines) = $nlines - 1;
print "$nlines blastn hits read\n\n";

############################################################################
# Conduct analyses of all pairs
############################################################################
my @temparray;
my @templine;
my($seqfname);
my %filterhash;
my $subject_length;

open ($INF, $infile) or die "Could not open file $infile for input.\n";
@temparray = <$INF>;
close($INF) or die "Could not close file $infile.\n";
$t1 = $#temparray + 1;
print "$infile\t$t1\n";
$seqfname = "$infile" . ".fasta";
if ( -e $seqfname ) {
        print "    --> $seqfname exists. overwriting\n";
        unlink($seqfname);
    }   

# iterate through the individual hits
for ($jter=0; $jter<$t1; $jter++) {
    (@templine) = split(/\s+/, $temparray[$jter]);
        if($templine[8] > $templine[9]){
                $subject_length = $templine[8] -$templine[9] +1;
        }else{
                $subject_length = $templine[9] -$templine[8] + 1;
        }
        if(exists $filterhash{$templine[2]} ){
                if($filterhash{$templine[2]} < $subject_length){
                        $filterhash{$templine[2]}= $subject_length;
                }
        }
        else{
                $filterhash{$templine[2]}= $subject_length;
        }
    }
my %printhash;
for ($jter=0; $jter<$t1; $jter++) {
    (@templine) = split(/\s+/, $temparray[$jter]);
        if($templine[8] > $templine[9]){
                $subject_length = $templine[8] -$templine[9] +1;
        }else{
                $subject_length = $templine[9] -$templine[8] +1;
        }
        if(not exists $printhash{$templine[2]})
        {
                 $printhash{$templine[2]}=1;
                if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
                        $com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
                        # print "$com\n";
                        system("$com");
                        system("cat temp.3 >> $seqfname");
                        }
        }
    } # end for ($jter=0; $jter<$t1...

# Arguments for "extract_from_genome2"
#   // argv[1] = name of genome file
#   // argv[2] = gi number for contig
#   // argv[3] = start of subsequence
#   // argv[4] = end of subsequence
#   // argv[5] = name of output sequence

Using this script, here is the output I'm getting:

>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA

As you can see, it's pretty close to what I'm wanting. Here are the two issues I have and cannot seem to figure out how to resolve with my script. The first is that a sequence may occur more than once in the sseqid column, and with the script in its current form, it will print out duplicates of these sequences. I only need one. How can I modify my script to not duplicate sequences (i.e., how do I only retain one but remove the other duplicates)? Expected output:

>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA

The second is the script is not quite extracting the right base pairs. It's super close, off by one or two, but its not exact. For example, take the first subject hit Apil_comp17195_c0_seq1. The sstart and send values are 824 and 1072, respectively. When I go to the all_transcriptome_seq.fasta, I get

AAGATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAAC

at that base pair range, not

GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA

as outputted by my script. You will also notice that the sequence outputted by my script is slightly shorter than it should be. Does anyone know how I can fix these issues in my script?

Thanks, and sorry for the lengthy post!

Edit 1: a solution was offered that work for some of the infiles. However, some were causing the script to output fewer sequences than expected. Note for Edit 1: this issue has been largely resolved based on a solution that has been incorporated into the perl script I have provided and accounts for reverse orientation of sequences

Edit 2: There is still an occasional output lacking fewer sequences than expected, although not as many after incorporating modifications to my script from Edit 1 suggestion (i.e., accounting for reverse direction). I cannot figure out why the script would be outputting fewer sequences in these other cases. Below the infile in question. The output is lacking Btri_comp15171_c0_seq1:

Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464    Apil_comp19456_c0_seq1  3549    1   1   2464    1   761 3224    0.0 4551    100.00  2464    2464
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464    Btri_comp15171_c0_seq1  3766    1   1   2456    1   3046    591 0.0 1877    80.53   1985    2465
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457    Apil_comp19456_c0_seq1  3549    1   1   2457    1   3214    758 0.0 1879    80.54   1986    2466
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457    Atri_comp28646_c0_seq1  1403    1   1256    2454    1   1401    203 0.0  990    81.60   980 1201
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457    Btri_comp15171_c0_seq1  3766    1   1   2457    1   593 3049    0.0 4538    100.00  2457    2457

Also, sequences outputted are still off by one or two base pairs. I tried one suggestion, which was to add + 1 to the end of $subject_length = $sstart -$send and $subject_length = $send -$sstart strings, but this did not change anything with the output.

perl dna blast • 2.0k views

ADD COMMENT • link 8.3 years ago by mforthman ▴ 50