Question

extract sequences based on ids file

3

Entering edit mode

7.6 years ago

Mehmet ▴ 820

Dear all,

I have a ids list that has random order of ids and a fasta file that has the ids. I want to extract sequence of each id from the fasta file.

I tried but each time output has different order rather than the order in the ids file.

How to extract sequences based on the order in the ids file?

sequence gene genome • 6.2k views

ADD COMMENT • link updated 7.6 years ago by Alex Reynolds 36k • written 7.6 years ago by Mehmet ▴ 820

Alex Reynolds · Answer 1 · 2017-09-23

3

Entering edit mode

7.6 years ago

Alex Reynolds 36k

Via bash shell and awk:

$ while read -r line; do awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }' sequences.fa; done < patterns.txt

ADD COMMENT • link 7.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you very much. It was what I really wanted.

ADD REPLY • link 7.6 years ago by Mehmet ▴ 820

0

Entering edit mode

I Tried it by typing the same in the terminal but the result file is empty.

ADD REPLY • link 7.3 years ago by majeedaasim ▴ 60

0

Entering edit mode

That's too bad. Post your files somewhere and maybe I can take a look.

ADD REPLY • link 7.3 years ago by Alex Reynolds 36k

0

Entering edit mode

Alex, I have a fasta file containing many sequences like

>Seq1
atgccaaagtagatacagatagac
>seq2
atattagacagatacaatagacag
>seq3
aggagatacagatacagatac
>seq4
atgacagatacagatacagatacagat
>seq5
agtagataacacagatagacagat
>seq6
agtaacagtacagatacagatacagat

I also have an ID file containing a list of sequence IDs as

seq6
seq3
seq1

NOw I want to extract these sequences from the fasta file bu in the same order. I tried different tools but they do not maintain the order of the sequences extracted. All of these extract sequences in the order as found in the fasta file as seq1 seq3 seq6. but I need in the above order. Thanks thanks

ADD REPLY • link 7.3 years ago by majeedaasim ▴ 60

1

Entering edit mode

Worked fine for me:

$ cat > sequences.fa
>Seq1
atgccaaagtagatacagatagac
>seq2
atattagacagatacaatagacag
>seq3
aggagatacagatacagatac
>seq4
atgacagatacagatacagatacagat
>seq5
agtagataacacagatagacagat
>seq6
agtaacagtacagatacagatacagat

Then:

$ cat > patterns.txt
seq6
seq3
seq1

Then:

$ while read -r line; do awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }' sequences.fa; done < patterns.txt
>seq6
agtaacagtacagatacagatacagat
>seq3
aggagatacagatacagatac

Looks like the same order, where there are matches found.

Note that Seq1 is uppercase in your sequences.fa file and lowercase seq1 in patterns.txt.

If case-sensitivity is an issue, pre-process your data to fix patterns or sequence headers. Or perhaps modify the test in awk to apply toupper() or similar on both the pattern and sequence header, before testing for a pattern match.

ADD REPLY • link 7.3 years ago by Alex Reynolds 36k

0

Entering edit mode

Do I need to cat file even if I have all the sequences and IDs list in their corresponding files

ADD REPLY • link 7.3 years ago by majeedaasim ▴ 60

0

Entering edit mode

I just typed in terminal but I am getting nothing, where is the result file produced.

while read -r line; do awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }' my_final_seq.fa; done < my_final_Ids.txt

ADD REPLY • link 7.3 years ago by majeedaasim ▴ 60

3

Entering edit mode

Please make sure that ID names in your id file are exactly the same with those in your fasta file as Alex suggested.

while read -r line; do awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }' seq.fa; done < idfile.txt > output.fa

the output.fa file

>seq6
agtaacagtacagatacagatacagat
>seq3
aggagatacagatacagatac
>seq1
atgccaaagtagatacagatagac

ADD REPLY • link 7.3 years ago by Mehmet ▴ 820

0

Entering edit mode

Hey! I have the same issue and wanted to ask whether it is possible to change the code in order to also get results in the following case:

pattern.txt:

seq6   
seq3    
seq1

sequences.fa:

>seq6 Domain;Phylum;class;order;family;genus;species;acutal name
CAGUUAGUUGGUGGGGUAA
>seq3 Domain;Phylum;class;order;family;genus;species;acutal name
CUGGUGUAGGGGUAAAAUCCGUAG
.
.

(the " " are just used to display the format, they are not part of the file) Thanks

ADD REPLY • link updated 4.4 years ago by Alex Reynolds 36k • written 4.4 years ago by Jan.V.Henkel • 0