Question

How to extract specific fasta files ?

2

Entering edit mode

3.6 years ago

sunnykevin97 ▴ 990

Hi

I had a text file with a list of sequence ID's 1....1000. (prefix - OG00*) **Text file:

OG0010960
OG0010966
OG0010968
OG0010972
OG0010979
OG0010980
OG0010981
OG0010982
OG0010983
OG0010984

In a directory, I had 66170 fasta files each fasta seq has a unique ID.

OG0066161.fa
OG0066162.fa
OG0066162.fa
OG0056185.fa
OG0056185.fa
OG0056185.fa
OG0000001.fa 
OG0000002.fa

I'm wish to extract only the fasta files of the given ID's in the text file ?

Need some help, I tried to do using grep it doesn't work out.

Thanks

Kevin

fasta DNA alignment RNA • 3.0k views

ADD COMMENT • link updated 3.6 years ago by cpad0112 21k • written 3.6 years ago by sunnykevin97 ▴ 990

3

Entering edit mode

Assuming that ids.txt (file with ids), fasta files and an empty directory by name 'extracted_files' exists within same directory

$ parallel --dry-run cp {}.fa extracted_files/{}.fa :::: ids.txt                                                                                                                             

cp OG0010960.fa extracted_files/OG0010960.fa
cp OG0010966.fa extracted_files/OG0010966.fa
cp OG0010968.fa extracted_files/OG0010968.fa
cp OG0010972.fa extracted_files/OG0010972.fa
cp OG0010979.fa extracted_files/OG0010979.fa
cp OG0010980.fa extracted_files/OG0010980.fa
cp OG0010981.fa extracted_files/OG0010981.fa
cp OG0010982.fa extracted_files/OG0010982.fa
cp OG0010983.fa extracted_files/OG0010983.fa
cp OG0010984.fa extracted_files/OG0010984.fa

Please remove echo after checking the dry-run output, to copy the files from existing directory to the new directory.

If you have headers in ids.txt (file with ids) and if you want to extract sequences with those headers, try seqkit:

$ seqkit grep -in -f ids.txt *.fa -o out.fa

ADD REPLY • link 3.6 years ago by cpad0112 21k

0

Entering edit mode

seqkit -in option I'm unable to find in the help file ?

ADD REPLY • link 3.6 years ago by sunnykevin97 ▴ 990

0

Entering edit mode

I missed grep in the function. Edited the function.

ADD REPLY • link 3.6 years ago by cpad0112 21k

2

Entering edit mode

I am occasionally amazed by people's patience on this site.

This was an insufficiently explained original post (I tried to do using grep it doesn't work out.) where better explanation could have saved everyone lots of time. Even with all the attempts to help, the OP is only responding with It doesn't work without any useful details. As in: you guys go ahead and keep feeding me ideas, and I will give you monosyllabic responses. Yet more suggestions keep coming. Kudos to you guys. I am very happy that there is no down-vote option, because these kinds of exchanges definitely tempt me.

ADD REPLY • link 3.6 years ago by Mensur Dlakic ★ 28k

1

Entering edit mode

Sometimes I do wish for Stack Exchange like options. This is purely a grep question, and with OP's attitude, it should have been closed way back if not for a bunch of us indulging them.

ADD REPLY • link 3.6 years ago by Ram 44k

0

Entering edit mode

Show us the first few lines of your ID file and a few sample FASTA headers so we understand your problem better.

ADD REPLY • link 3.6 years ago by Ram 44k

0

Entering edit mode

I edited the post.

ADD REPLY • link 3.6 years ago by sunnykevin97 ▴ 990

score 1 · Answer 1 · 2021-04-07

1

Entering edit mode

3.6 years ago

sg927357 ▴ 20

(prefix - OG00*)

Try this:

cd dir
cat sequence_ID_text_file | while read i ; do cat OG00$i >>text_file.txt ; done

ADD COMMENT • link 3.6 years ago by sg927357 ▴ 20

0

Entering edit mode

It doesn't work.

ADD REPLY • link 3.6 years ago by sunnykevin97 ▴ 990

1

Entering edit mode

cd dir; cat sequence_ID_text_file | while read i ; do cat $i.fa >>text_file.fa ; done

ADD REPLY • link 3.6 years ago by sg927357 ▴ 20

score 1 · Answer 2 · 2021-04-07

1

Entering edit mode

3.6 years ago

Ram 44k

Avoid loops in shell - they can have a performance overhead. Use xargs instead - it will be at least as performant as loops if not better, plus you can process multiple arguments if the situation fits AND switch over to GNU parallel with trivial changes in approach.

cat list.txt | xargs -I v_f cat v_f.fa >> result.fasta

ADD COMMENT • link 3.6 years ago by Ram 44k

1

Entering edit mode

Agreed. Thanks for the suggestion.

ADD REPLY • link 3.6 years ago by Jorge Amigo 14k

0

Entering edit mode

No problem. Loops are more readable and easier to debug, but once we get to parallel, we can use --dry-run to debug better. I wish xargs had that option.

EDIT: Looks like xargs -p executes each xargs interactively, which is close to dry-run. https://unix.stackexchange.com/a/428799/135331

ADD REPLY • link 3.6 years ago by Ram 44k

GenoMax · Answer 3 · 2021-04-06

0

Entering edit mode

3.6 years ago

Pierre Lindenbaum 164k

find DIR -type f -name "*.fa" -exec awk -f linearizefasta.awk '{}' ';'  | sort -t $'\t' -k1,1 | join -t $'\t' -1 1 -2 1 - <(sed 's/^/>/' myidlist.txt | sort -t $'\t' -k1,1 ) | tr "\t" "\n"

with linearizefasta.awk

ADD COMMENT • link updated 3.6 years ago by GenoMax 147k • written 3.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

It doesn't work. How do I redirect the output.

ADD REPLY • link 3.6 years ago by sunnykevin97 ▴ 990

score 0 · Answer 4 · 2021-04-07

From your initial question it looked like you needed to select fasta ids from inside the files, but from the latest edit it looks that the fasta ids are in fact file names. If that's the case, this simple command would write your wanted fasta ids into a single multifasta file:

for id in $(cat list.txt); do cat $id.fa 2>/dev/null; done > out.fa

If with "extract" you mean to copy/move files outside that directory, you can do it too similarly although this is not limited to fasta files:

mkdir -p out; for id in $(cat list.txt); do cp $id.fa out 2>/dev/null; done