Question

How do I extract Fasta Sequences based on a list of IDs?

3

Entering edit mode

6.4 years ago

a.rex ▴ 350

I have a fasta sequence file that looks like this:

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS
WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY
LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY
YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD
AGEGEN
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ 
 .........

I have another file that has a subset of headers for this file:

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
.......

How can I use the header file to extract out my fasta sequences only for these headers?

Thank you in advance!

grep sequence • 9.8k views

ADD COMMENT • link updated 2.5 years ago by Joe 21k • written 6.4 years ago by a.rex ▴ 350

score 6 · Answer 1 · 2018-07-30

6

Entering edit mode

6.4 years ago

GenoMax 148k

This is one of FAQ's on biostars. You should search for additional threads but here are some to get you started.

Extracting specific IDs + sequence from multifasta
extract sequences based on ids file
Extracting specific sequences from a big fasta file using ids of the sequences to be excluded

My personal favorite is a program from Jim Kent/UCSC:

faSomeRecords

Download the linux version linked (macOS available elsewhere on that site). Add execute permissions chmod a+x faSomeRecords.

 ./faSomeRecords 
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

NOTE: You should remove > from your header list file when using it as an input in place of listFile.

ADD COMMENT • link 6.4 years ago by GenoMax 148k

0

Entering edit mode

Dear, is there a R way to do the same thing?

ADD REPLY • link 3.9 years ago by lessismore ★ 1.4k

0

Entering edit mode

Great handy tool for massive datasets.

ADD REPLY • link 2.5 years ago by sunnykevin97 ▴ 990

score 1 · Answer 2 · 2018-07-30

1

Entering edit mode

6.4 years ago

Rashedul Islam ▴ 480

You can use this:

while read line; do grep -A 5 "$line" fastaFile; done <listFile

-A in grep means lines-after the match.

ADD COMMENT • link 6.4 years ago by Rashedul Islam ▴ 480

1

Entering edit mode

This will only work if the sequences are all 5 lines long which is not a generally safe assumption.

ADD REPLY • link 6.4 years ago by Joe 21k

0

Entering edit mode

In that case you can extract fasta sequences by ids using seqkit

while read line; do less fastaFile.fasta | seqkit grep -p $line; done < listFile >>out.fasta

ADD REPLY • link 3.9 years ago by Rashedul Islam ▴ 480

0

Entering edit mode

I tried, it generates the empty file. Suggestions please.

Please check my post. How to extract fasta subsequences from a multiline fasta file which has very long headers

ADD REPLY • link 2.5 years ago by sunnykevin97 ▴ 990

0

Entering edit mode

Dont just use commands at random from the web. Spend some time understanding why the solution might not be working for you. You have good suggestions in the other thread. This code is not well suited to the task you describe.

ADD REPLY • link 2.5 years ago by Joe 21k