Hi all, I have a list of txt file
123489
12387
16379
and i want to extract their sequence from a file
>Os01g3345.1 pacid=123489 polypeptide=Os01g3345.1 locus=Os01g3345
ATTTTCGGGGAAATTTCCGGGGG
ATTGGCCTTAAA
>AT01g3345.1 pacid=123567 polypeptide=AT01g3345.1 locus=Os01g3345
ATTTTCGGGGAAATTTCCGGGGG
ATTGGCCTTAAA
and so on.
My question is how to use pacid as a query to extract sequence match txt file?
I have tried something like this, only last match appear. But I want all matched result. Could any one help, thanks.
#!/usr/bin/perl
use strict;
use warnings;
$ARGV[2] or die "use getSeqs.pl <File with IDs> <Input Fasta> <Output Fasta>\n";
my $list_file = shift @ARGV;
my $fasta_in = shift @ARGV;
my $fasta_out = shift @ARGV;
my %sel;
open (my $lh, "<", $list_file) or die;
while (<$lh>) {
chomp;
$sel{$_}++;
}
close $lh;
$/ = "\n>";
open (my $ih, "<", $fasta_in) or die;
open (my $oh, ">", $fasta_out) or die;
while (<$ih>) {
s/>//g;
my ($id_line, @seq) = split (/\n/, $_);
if ($id_line =~ /pacid=(\w+)/) {
my $id = $1;
if (defined $sel{$id}) {
print $oh ">$id\n";
print $oh join "\n", @seq;
print $oh "\n";
}
}
}
close $ih;
close $oh;
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thank you!
thank you!
If pacid's are unique, then you can use grep direct.
Hi, thanks for your reply. I can get result, but only last gene in the list are shown in the result file. Do you know why?