Question

Extract Sequence from Blast tab limited out put

0

Entering edit mode

9.0 years ago

Eva_Maria ▴ 190

I got a blastp (-m 8) output like this

GCF_000707685.1_726     GCF_000958575.1_152     98.62   217     3   0   1   217     1   217     4e-158  444
GCF_000707684.1_726     GCF_000878815.1_1985    98.62   217     3   0   1   217     1   217     4e-158  444
GCF_000707683.1_726     GCF_000878805.1_154     98.62   217     3   0   1   217     1   217     4e-158  444

I want to extract all sequences like this

>GCF_000707685.1_726
MKQFLKITVLLLVISFKADAAKRPEVPICQPWPECWDLRVDPESVIEESLDPSRPKSEDT
IDVMDIVSSSMSFSCIRWRVAGMCVWYKWPYKISTSVKVNHYIPDYVVSAYERSGENT
>GCF_000707684.1_726
MKTFKLTILAVCCISSTVSANSTRDSLEPILENGPWYYEVGGARYVPLTRLDSSRISAGA
ISWNGNLMCSNLDPSVSMDAFMNGAKEGFINLQRNAVSTFKGVIASLPGLALQHADPGL

Is there any perl programme or awk is available to tackle this? Please help me

awk blast perl • 2.4k views

ADD COMMENT • link updated 3.6 years ago by Ram 44k • written 9.0 years ago by Eva_Maria ▴ 190

Ram · Accepted Answer · 2015-11-23

3

Entering edit mode

9.0 years ago

5heikki 11k

Parse the sequence IDs from column 2 (e.g. `awk -F '\t' '{print $2}' file > IDs) and extract them with blastdbcmd.

blastdbcmd -h
USAGE
  blastdbcmd [-h] [-help] [-db dbname] [-dbtype molecule_type]
    [-entry sequence_identifier] [-entry_batch input_file] [-pig PIG] [-info]
    [-range numbers] [-strand strand] [-mask_sequence_with mask_algo_id]
    [-out output_file] [-outfmt format] [-target_only] [-get_dups]
    [-line_length number] [-ctrl_a] [-show_blastdb_search_path]
    [-list directory] [-remove_redundant_dbs] [-recursive]
    [-list_outfmt format] [-logfile File_Name] [-version]

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by 5heikki 11k

0

Entering edit mode

sir I want all aligned sequences like fields 6 to fields 7 ( eg:1 to 217).

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by Eva_Maria ▴ 190

1

Entering edit mode

blastdbcmd -help
..
-entry_batch <File_In>
   Input file for batch processing (Format: one entry per line, seq id
   followed by optional space-delimited specifier(s)
___^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   [range|strand|mask_algo_id]
___^^^^^^^^^^^^^
..

So also parse that.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by 5heikki 11k

0

Entering edit mode

I have all ready one code but its not working..

use warnings;

print "Enter Your BLAST result file name:\t";
chomp($blast = <STDIN>);     # BLAST result file name
print "\n";

print "Enter Your Gene list file name:\t";
chomp($database = <STDIN>);  #  file name
print "\n";

open IN,"$blast" or die "Can not open file $blast: $!";

@ids = ();
@seq_start = ();
@seq_end = ();

while(<IN>){

@feilds = split("\t",$_);
push(@ids,$feilds[0]);
push(@seq_start,$feilds[6]);
push(@seq_end,$feilds[7]);
}
close IN;

open OUT,">Result.fasta" or die "Can not open file $database: $!";


for($i=0;$i<=$#ids;$i++){

($sequence)    = &block($ids[$i]);

($idline,$sequence) = split("\n",$sequence);

    if($seq_start[$i] <= 70){
    $pos_Start = 0;
    }
    else{
    $pos_Start = $seq_start[$i]-70;
    }

    $pos_end = $seq_end[$i]+70;
    if($pos_end >= length($sequence)){
    $pos_end = length($sequence);
    }    

$seqlen = $pos_end - $pos_Start;

$Nucleotides = substr($sequence,$pos_Start,$seqlen);

$Nucleotides =~ s/(.{1,60})/$1\n/gs;

print OUT "$idline\n";
print OUT "$Nucleotides\n";    
}
print "\nExtraction Completed...";

sub block{

$id1 =shift;
print "$id1\n";
$start = ();

open IN3,"$database" or die "Can not open file $database: $!";

$blockseq = "";
while(<IN3>){

    if (($_ =~ /^>/)&&($start)){

    last;
    }

    if (($_ !~ /^>/)&&($start)){

    chomp;
    $blockseq .= $_;
    }

    if (/^>$id1/){

    $start = $.;print "$.\n";#-1;
    $blockseq .= $_;
    }
}
close IN3;

return($blockseq);
}<STDIN>;

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by Eva_Maria ▴ 190