Question

Extract specific sequence from FASTA file

0

Entering edit mode

5.8 years ago

ysas ▴ 10

I am trying to extract several sequences from a Fasta file using IDs partially matching with the header. I have written a script to perform it, but I only get one sequence.

grep -Fwf ID_list.txt -A1 input.fasta >> output.fa

Here is input.fasta

>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....
>xxx|Issori2|100354|CE99607_9185
ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCTG......
>xxx|Issori2|100388|CE99641_51257
ATGTCACAAGAAAAACATTGGAACTATACCAAAGATATTGTCAGGACATCGATTTCTGGTGTCTGTGC......

Here is my ID_list.txt

CE101211_3315 
CE99767_31939
CE99607_9185
CE99543_15407

Here is output.fa

>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....

Somehow I only get the sequence matched with the ID listed at the end of text file. Could you please point out how can I get all sequences matched with ID list?

Thank you for your help.

fasta • 7.5k views

ADD COMMENT • link updated 5.8 years ago by h.mon 35k • written 5.8 years ago by ysas ▴ 10

shenwei356 · Accepted Answer · 2019-11-07

code you posted works on my system as expected (mxlinux 19 - 64bit, gnu grep 3.3)

input:

    $ cat test.fa 
    >xxx|Issori2|100290|CE99543_15407
    ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....
    >xxx|Issori2|100354|CE99607_9185
    ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCTG......
    >xxx|Issori2|100388|CE99641_51257
    ATGTCACAAGAAAAACATTGGAACTATACCAAAGATATTGTCAGGACATCGATTTCTGGTGTCTGTGC......

$ cat ids.txt 
    CE101211_3315 
    CE99767_31939
    CE99607_9185
    CE99543_15407

output:

$ grep -Fwf ids.txt -A1 test.fa 
>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAATCAGTTGTCC....
>xxx|Issori2|100354|CE99607_9185
ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCTG......

with seqkit:

# EDIT: 
# Original post uses "seqkit grep -nrif" which it's PARTLY matching full FASTA
# header by regular expression and case-ignored, with patterns from file.
# For this case, it's not necessary and may produce "false positive", and also slower.
# For example, `seq_1` matches `seq_10` with `-nri`.

$ seqkit grep -i -f ids.txt test.fa 
>xxx|Issori2|100290|CE99543_15407
ATGGCTGTCAAGATTAGGAAACCACAGTACAAAGAAAGAGGCATTACTTGGGAAGATCAA
TCAGTTGTCC....
>xxx|Issori2|100354|CE99607_9185
ATGTCCCATATTGTTCGTATACCCAATGTCTTTGATCACAACTCTGACCTCCCAATACCT
G......

Please use dedicated tools for the job. for eg. seqtk, seqkit etc. @ ysas