How to extract specific fasta files ?
4
2
Entering edit mode
3.6 years ago
sunnykevin97 ▴ 990

Hi

I had a text file with a list of sequence ID's 1....1000. (prefix - OG00*) **Text file:

OG0010960
OG0010966
OG0010968
OG0010972
OG0010979
OG0010980
OG0010981
OG0010982
OG0010983
OG0010984

In a directory, I had 66170 fasta files each fasta seq has a unique ID.

OG0066161.fa
OG0066162.fa
OG0066162.fa
OG0056185.fa
OG0056185.fa
OG0056185.fa
OG0000001.fa 
OG0000002.fa

I'm wish to extract only the fasta files of the given ID's in the text file ?

Need some help, I tried to do using grep it doesn't work out.

Thanks

Kevin

fasta DNA alignment RNA • 3.0k views
ADD COMMENT
3
Entering edit mode

Assuming that ids.txt (file with ids), fasta files and an empty directory by name 'extracted_files' exists within same directory

$ parallel --dry-run cp {}.fa extracted_files/{}.fa :::: ids.txt                                                                                                                             

cp OG0010960.fa extracted_files/OG0010960.fa
cp OG0010966.fa extracted_files/OG0010966.fa
cp OG0010968.fa extracted_files/OG0010968.fa
cp OG0010972.fa extracted_files/OG0010972.fa
cp OG0010979.fa extracted_files/OG0010979.fa
cp OG0010980.fa extracted_files/OG0010980.fa
cp OG0010981.fa extracted_files/OG0010981.fa
cp OG0010982.fa extracted_files/OG0010982.fa
cp OG0010983.fa extracted_files/OG0010983.fa
cp OG0010984.fa extracted_files/OG0010984.fa

Please remove echo after checking the dry-run output, to copy the files from existing directory to the new directory.

If you have headers in ids.txt (file with ids) and if you want to extract sequences with those headers, try seqkit:

$ seqkit grep -in -f ids.txt *.fa -o out.fa 
ADD REPLY
0
Entering edit mode

seqkit -in option I'm unable to find in the help file ?

ADD REPLY
0
Entering edit mode

I missed grep in the function. Edited the function.

ADD REPLY
2
Entering edit mode

I am occasionally amazed by people's patience on this site.

This was an insufficiently explained original post (I tried to do using grep it doesn't work out.) where better explanation could have saved everyone lots of time. Even with all the attempts to help, the OP is only responding with It doesn't work without any useful details. As in: you guys go ahead and keep feeding me ideas, and I will give you monosyllabic responses. Yet more suggestions keep coming. Kudos to you guys. I am very happy that there is no down-vote option, because these kinds of exchanges definitely tempt me.

ADD REPLY
1
Entering edit mode

Sometimes I do wish for Stack Exchange like options. This is purely a grep question, and with OP's attitude, it should have been closed way back if not for a bunch of us indulging them.

ADD REPLY
0
Entering edit mode

Show us the first few lines of your ID file and a few sample FASTA headers so we understand your problem better.

ADD REPLY
0
Entering edit mode

I edited the post.

ADD REPLY
1
Entering edit mode
3.6 years ago
sg927357 ▴ 20

(prefix - OG00*)

Try this:

cd dir
cat sequence_ID_text_file | while read i ; do cat OG00$i >>text_file.txt ; done
ADD COMMENT
0
Entering edit mode

It doesn't work.

ADD REPLY
1
Entering edit mode

cd dir; cat sequence_ID_text_file | while read i ; do cat $i.fa >>text_file.fa ; done

ADD REPLY
1
Entering edit mode
3.6 years ago
Ram 44k

Avoid loops in shell - they can have a performance overhead. Use xargs instead - it will be at least as performant as loops if not better, plus you can process multiple arguments if the situation fits AND switch over to GNU parallel with trivial changes in approach.

cat list.txt | xargs -I v_f cat v_f.fa >> result.fasta
ADD COMMENT
1
Entering edit mode

Agreed. Thanks for the suggestion.

ADD REPLY
0
Entering edit mode

No problem. Loops are more readable and easier to debug, but once we get to parallel, we can use --dry-run to debug better. I wish xargs had that option.

EDIT: Looks like xargs -p executes each xargs interactively, which is close to dry-run. https://unix.stackexchange.com/a/428799/135331

ADD REPLY
0
Entering edit mode
3.6 years ago
find DIR -type f -name "*.fa" -exec awk -f linearizefasta.awk '{}' ';'  | sort -t $'\t' -k1,1 | join -t $'\t' -1 1 -2 1 - <(sed 's/^/>/' myidlist.txt | sort -t $'\t' -k1,1 ) | tr "\t" "\n"

with linearizefasta.awk

ADD COMMENT
0
Entering edit mode

It doesn't work. How do I redirect the output.

ADD REPLY
0
Entering edit mode
3.6 years ago

From your initial question it looked like you needed to select fasta ids from inside the files, but from the latest edit it looks that the fasta ids are in fact file names. If that's the case, this simple command would write your wanted fasta ids into a single multifasta file:

for id in $(cat list.txt); do cat $id.fa 2>/dev/null; done > out.fa

If with "extract" you mean to copy/move files outside that directory, you can do it too similarly although this is not limited to fasta files:

mkdir -p out; for id in $(cat list.txt); do cp $id.fa out 2>/dev/null; done
ADD COMMENT

Login before adding your answer.

Traffic: 1982 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6