Extract sequence with header from a fasta file with specific ID given in another file
5
1
Entering edit mode
9.9 years ago

Hi,

I want to extract sequence with header from a fasta file with specific ID given in another file.

As for example:

File: main.fa

>ADC37925 pep:novel supercontig:GCA_000025145.2:CP001844:1841991:1848551:-1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATSNPNGAQAL
>EFB95474 pep:novel supercontig:GCA_000175495.1:cont1.9:99969:106529:-1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQSTNQALVNHNNGSIANQGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVGIFSTLIGTVLLLSNPNGAQ
>EFC04694 pep:novel supercontig:GCA_000175955.1:cont1.2:270427:276987:-1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQSTNQALVNHNNGSIANQATPTSVQSSTPSAQNNNHTDGNTTATETVSNANNKDVVSNNTTLNVPNKTNENGS
>EFH37336 pep:novel supercontig:GCA_000178015.1:cont1.4:98713:105273:1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQSTNQALVNHNNGSIANQVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTVLLLSNPNGAQALTTDNNVQSD

File: id.txt

ADC37925
EFC04694

Expected Outcome:

>ADC37925 pep:novel supercontig:GCA_000025145.2:CP001844:1841991:1848551:-1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATSNPNGAQAL
>EFC04694 pep:novel supercontig:GCA_000175955.1:cont1.2:270427:276987:-1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQSTNQALVNHNNGSIANQATPTSVQSSTPSAQNNNHTDGNTTATETVSNANNKDVVSNNTTLNVPNKTNENGS

Tried Perl Program :

I have found and made the Perl code work for a single input (ID) [Source]

########################## script.pl
#Usage: `perl script.pl file(s) 'searchTerm [searchTerm]' [>outFile]`

use strict;
use warnings;

my $term = join '.', map "\Q$_\E", split ' ', pop;
my $found;

while (<>) {
    if (/^>/) {
        $found = /$term/i ? 1 : 0;
    }
    print if $found;
}

###########################

Now how to use this perl code by taking the searchTerm from id.txt file to search in main.fa and give the outcome as above?

Any help will be very much appreciated.

I am new in Perl.

sequence Perl FASTA • 21k views
ADD COMMENT
0
Entering edit mode

You have some perl code there that should work. What you need is a tutorial on how to get started with perl, and that is going to be different based on whether you're using windows or linux. Try googling for using perl on [your OS].

ADD REPLY
3
Entering edit mode
9.8 years ago

My Solution:

UCGS utilities

$ ./faSomeRecords main.fasta id.txt output.fa

option -exclude will output sequences not present in "main.fasta"

ADD COMMENT
0
Entering edit mode

Glad you found a solution that works for you!

ADD REPLY
0
Entering edit mode

Thanks Matt !!

ADD REPLY
8
Entering edit mode
9.9 years ago

You can do that with BBTools:

filterbyname.sh in=main.fa names=id.txt out=filtered.fa include
ADD COMMENT
4
Entering edit mode
9.9 years ago

Not a Perl solution, but:

pip install pyfaidx
xargs faidx -d ' ' main.fa <id.txt
ADD COMMENT
0
Entering edit mode

See also the bioawk solution here: Filtering fasta file based on identifier

ADD REPLY
0
Entering edit mode

And while I'm at it I should mention that anyone wanting to do this with Python could use pyfaidx as a module: https://github.com/mdshw5/pyfaidx. Your Perl logic would be simplified a bit since you could query the Fasta file like a dictionary (hash in Perl) using the sequence ID as a key. The Fasta class in pyfaidx has a split_char argument for just this purpose.

ADD REPLY
3
Entering edit mode
9.8 years ago

Assuming a FASTA file:

$ cat test.fa
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
>blah_C2
ACTTTATATATT
>blah_C3
ACTTATATATATATA
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

and an IDs file:

$ cat IDs.txt
blah_C4
blah_C5

Awk solution:

$ awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' test.fa | awk -F"\t" 'BEGIN{while((getline k < "IDs.txt")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}'
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

Bioawk solution:

$ bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' test.fa
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

Best Wishes,
Umer

ADD COMMENT
0
Entering edit mode

Can you please briefly explain the meaning of this script:

$ awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' test.fa | awk -F"\t" 'BEGIN{while((getline k < "IDs.txt")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}'

Please explain how I can change this script for given files:

$ cat test.fa
Abc4S00911:28910-30669
ACTTATATATATATA
Abc4S01204:22252-24349
ACTTTATATATT
Abc4S01758:32845-35133
ACTTATATATATATA

$ cat IDs.txt
Abc4S00911:28910-30669
Abc4S01204:22252-24349

Thanks in advance.

ADD REPLY
0
Entering edit mode

sir this command running successfully but its not giving any result. can you please give explanation why its not printing any thing??

ADD REPLY
0
Entering edit mode
9.9 years ago

Read the very first chapter in BioPython tutorial/cookbook. You will get an idea.

ADD COMMENT

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6