Question

Extract sequence with header from a fasta file with specific ID given in another file

1

Entering edit mode

10.3 years ago

Tanvir Ahamed ▴ 350

Hi,

I want to extract sequence with header from a fasta file with specific ID given in another file.

As for example:

File: main.fa

>ADC37925 pep:novel supercontig:GCA_000025145.2:CP001844:1841991:1848551:-1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATSNPNGAQAL
>EFB95474 pep:novel supercontig:GCA_000175495.1:cont1.9:99969:106529:-1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQSTNQALVNHNNGSIANQGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVGIFSTLIGTVLLLSNPNGAQ
>EFC04694 pep:novel supercontig:GCA_000175955.1:cont1.2:270427:276987:-1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQSTNQALVNHNNGSIANQATPTSVQSSTPSAQNNNHTDGNTTATETVSNANNKDVVSNNTTLNVPNKTNENGS
>EFH37336 pep:novel supercontig:GCA_000178015.1:cont1.4:98713:105273:1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQSTNQALVNHNNGSIANQVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTVLLLSNPNGAQALTTDNNVQSD

File: id.txt

ADC37925
EFC04694

Expected Outcome:

>ADC37925 pep:novel supercontig:GCA_000025145.2:CP001844:1841991:1848551:-1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATSNPNGAQAL
>EFC04694 pep:novel supercontig:GCA_000175955.1:cont1.2:270427:276987:-1
MNLLKKNKYSIRKYKVGIFSTLIGTVLLLSNPNGAQALTTDNNVQSDTNQATPVNSQDTNVANNRGLANSAQNTPNQSATTNQSTNQALVNHNNGSIANQATPTSVQSSTPSAQNNNHTDGNTTATETVSNANNKDVVSNNTTLNVPNKTNENGS

Tried Perl Program :

I have found and made the Perl code work for a single input (ID) [Source]

########################## script.pl
#Usage: `perl script.pl file(s) 'searchTerm [searchTerm]' [>outFile]`

use strict;
use warnings;

my $term = join '.', map "\Q$_\E", split ' ', pop;
my $found;

while (<>) {
    if (/^>/) {
        $found = /$term/i ? 1 : 0;
    }
    print if $found;
}

###########################

Now how to use this perl code by taking the searchTerm from id.txt file to search in main.fa and give the outcome as above?

Any help will be very much appreciated.

I am new in Perl.

sequence Perl FASTA • 21k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Tanvir Ahamed ▴ 350

0

Entering edit mode

You have some perl code there that should work. What you need is a tutorial on how to get started with perl, and that is going to be different based on whether you're using windows or linux. Try googling for using perl on [your OS].

ADD REPLY • link 10.3 years ago by Daniel ★ 4.0k

8

Entering edit mode

10.3 years ago

Brian Bushnell 20k

You can do that with BBTools:

filterbyname.sh in=main.fa names=id.txt out=filtered.fa include

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 10.3 years ago by Brian Bushnell 20k

4

Entering edit mode

10.3 years ago

Matt Shirley 10k

Not a Perl solution, but:

pip install pyfaidx
xargs faidx -d ' ' main.fa <id.txt

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 10.3 years ago by Matt Shirley 10k

0

Entering edit mode

See also the bioawk solution here: Filtering fasta file based on identifier

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Matt Shirley 10k

0

Entering edit mode

And while I'm at it I should mention that anyone wanting to do this with Python could use pyfaidx as a module: https://github.com/mdshw5/pyfaidx. Your Perl logic would be simplified a bit since you could query the Fasta file like a dictionary (hash in Perl) using the sequence ID as a key. The Fasta class in pyfaidx has a split_char argument for just this purpose.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Matt Shirley 10k

3

Entering edit mode

10.2 years ago

umer.zeeshan.ijaz ★ 1.8k

Assuming a FASTA file:

$ cat test.fa
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
>blah_C2
ACTTTATATATT
>blah_C3
ACTTATATATATATA
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

and an IDs file:

$ cat IDs.txt
blah_C4
blah_C5

Awk solution:

$ awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' test.fa | awk -F"\t" 'BEGIN{while((getline k < "IDs.txt")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}'
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

Bioawk solution:

$ bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' test.fa
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

Best Wishes,
Umer

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 10.2 years ago by umer.zeeshan.ijaz ★ 1.8k

0

Entering edit mode

Can you please briefly explain the meaning of this script:

$ awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' test.fa | awk -F"\t" 'BEGIN{while((getline k < "IDs.txt")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}'

Please explain how I can change this script for given files:

$ cat test.fa
Abc4S00911:28910-30669
ACTTATATATATATA
Abc4S01204:22252-24349
ACTTTATATATT
Abc4S01758:32845-35133
ACTTATATATATATA

$ cat IDs.txt
Abc4S00911:28910-30669
Abc4S01204:22252-24349

Thanks in advance.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by bansal.ankush13 • 0

0

Entering edit mode

sir this command running successfully but its not giving any result. can you please give explanation why its not printing any thing??

ADD REPLY • link 9.0 years ago by Bulbul Ahmed ▴ 20

0

Entering edit mode

10.3 years ago

GouthamAtla 12k

Read the very first chapter in BioPython tutorial/cookbook. You will get an idea.

ADD COMMENT • link 10.3 years ago by GouthamAtla 12k

Ram · Accepted Answer · 2015-02-16

3

Entering edit mode

10.2 years ago

Tanvir Ahamed ▴ 350

My Solution:

UCGS utilities

$ ./faSomeRecords main.fasta id.txt output.fa

option -exclude will output sequences not present in "main.fasta"

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 10.2 years ago by Tanvir Ahamed ▴ 350

0

Entering edit mode

Glad you found a solution that works for you!

ADD REPLY • link 10.2 years ago by Matt Shirley 10k

0

Entering edit mode

Thanks Matt !!

ADD REPLY • link 10.2 years ago by Tanvir Ahamed ▴ 350