How do I extract Fasta Sequences based on a list of IDs?
2
3
Entering edit mode
6.3 years ago
a.rex ▴ 350

I have a fasta sequence file that looks like this:

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS
WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY
LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY
YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD
AGEGEN
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ 
 .........

I have another file that has a subset of headers for this file:

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
.......

How can I use the header file to extract out my fasta sequences only for these headers?

Thank you in advance!

grep sequence • 9.7k views
ADD COMMENT
6
Entering edit mode
6.3 years ago
GenoMax 147k

This is one of FAQ's on biostars. You should search for additional threads but here are some to get you started.

Extracting specific IDs + sequence from multifasta
extract sequences based on ids file
Extracting specific sequences from a big fasta file using ids of the sequences to be excluded

My personal favorite is a program from Jim Kent/UCSC:

faSomeRecords

Download the linux version linked (macOS available elsewhere on that site). Add execute permissions chmod a+x faSomeRecords.

 ./faSomeRecords 
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

NOTE: You should remove > from your header list file when using it as an input in place of listFile.

ADD COMMENT
0
Entering edit mode

Dear, is there a R way to do the same thing?

ADD REPLY
0
Entering edit mode

Great handy tool for massive datasets.

ADD REPLY
1
Entering edit mode
6.3 years ago

You can use this:

while read line; do grep -A 5 "$line" fastaFile; done <listFile

-A in grep means lines-after the match.

ADD COMMENT
1
Entering edit mode

This will only work if the sequences are all 5 lines long which is not a generally safe assumption.

ADD REPLY
0
Entering edit mode

In that case you can extract fasta sequences by ids using seqkit

while read line; do less fastaFile.fasta | seqkit grep -p $line; done < listFile >>out.fasta

ADD REPLY
0
Entering edit mode

I tried, it generates the empty file. Suggestions please.

Please check my post. How to extract fasta subsequences from a multiline fasta file which has very long headers

ADD REPLY
0
Entering edit mode

Dont just use commands at random from the web. Spend some time understanding why the solution might not be working for you. You have good suggestions in the other thread. This code is not well suited to the task you describe.

ADD REPLY

Login before adding your answer.

Traffic: 2628 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6