Question

use a list of id to extract sequence from different genomics

0

Entering edit mode

3.3 years ago

LZH289 • 0

Hi all, I have a list of txt file

   123489
   12387
   16379

and i want to extract their sequence from a file

>Os01g3345.1 pacid=123489 polypeptide=Os01g3345.1 locus=Os01g3345 
ATTTTCGGGGAAATTTCCGGGGG
ATTGGCCTTAAA
>AT01g3345.1 pacid=123567 polypeptide=AT01g3345.1 locus=Os01g3345 
ATTTTCGGGGAAATTTCCGGGGG
ATTGGCCTTAAA

and so on.

My question is how to use pacid as a query to extract sequence match txt file?

I have tried something like this, only last match appear. But I want all matched result. Could any one help, thanks.

    #!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use getSeqs.pl <File with IDs> <Input Fasta> <Output Fasta>\n";
my $list_file = shift @ARGV;
my $fasta_in = shift @ARGV;
my $fasta_out = shift @ARGV;

my %sel;
open (my $lh, "<", $list_file) or die;
while (<$lh>) {
    chomp;
    $sel{$_}++;
}
close $lh;

$/ = "\n>";
open (my $ih, "<", $fasta_in) or die;
open (my $oh, ">", $fasta_out) or die;
while (<$ih>) {
    s/>//g;
    my ($id_line, @seq) = split (/\n/, $_);
    if ($id_line =~ /pacid=(\w+)/) {
        my $id = $1;
        if (defined $sel{$id}) {
            print $oh ">$id\n";
            print $oh join "\n", @seq;
            print $oh "\n";
        }
    }
}
close $ih;
close $oh;

sequence perl • 1.4k views

ADD COMMENT • link 3.3 years ago by LZH289 • 0

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 3.3 years ago by GenoMax 147k

0

Entering edit mode

thank you!

ADD REPLY • link 3.3 years ago by LZH289 • 0

0

Entering edit mode

If pacid's are unique, then you can use grep direct.

$ awk  '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}} END {print "\r"}' test.fa | grep -A 1 -wf test.txt 

>Os01g3345.1 pacid=123489 polypeptide=Os01g3345.1 locus=Os01g3345 
ATTTTCGGGGAAATTTCCGGGGGATTGGCCTTAAA

ADD REPLY • link 3.3 years ago by cpad0112 21k

0

Entering edit mode

Hi, thanks for your reply. I can get result, but only last gene in the list are shown in the result file. Do you know why?

ADD REPLY • link 3.3 years ago by LZH289 • 0

score 0 · Answer 1 · 2021-07-27

0

Entering edit mode

3.3 years ago

Pierre Lindenbaum 164k

linearize and extract the ID at the same time, sort both list, join, restore the fasta:

join -t $'\t' -1 1 -2 1 \
     <(awk -F '[ =]+' '/^>/ {for(i=1;i<NF;i++){if($i=="pacid"){PAC=$(i+1);break;}} printf("%s%s\t%s\t",(N>0?"\n":""),PAC,$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' jeter.fa  | sort -t $'\t' -k1,1 ) \
     <(sort -t $'\t' -k1,1 list.id.txt ) |\
cut -f 2- | tr "\t" "\n"

ADD COMMENT • link 3.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi, thanks for your reply. I can get result, but only last gene in the list are shown in the result file. Do you know why?

ADD REPLY • link 3.3 years ago by LZH289 • 0

0

Entering edit mode

check you file is a ascii file (file list.id.txt), check there is no extra whitespaces in your files, etc...