How to grep fasta sequence using list of IDs
1
0
Entering edit mode
3.5 years ago
Kumar ▴ 170

Hi,

I have a file of fasta sequence and an another file list of its IDs. I am trying to extract sequences from the list of IDs. I tried the following commands but not getting any output. I would appreciate if I get any solutions.

Tried these commands:
$seqtk subseq test.fa test.txt 
$grep -A 1 -wFf list.txt sequences.fas > newfile2.fas
$for i in $(cut -d" " -f1- file2); do grep -o "$i" file1 | tee -a result.txt; done

Example: File 1:

>SEGI_09259
>SEGI_10011
>SEGI_06629

File 2:
 >SEGI_07257  
    MKICGWLYHFKFSKNMQGKVVLIIGL       
 >SEGI_10011    
    MNNCCFMVMRLGGSRSTGRGLKSSEAGE
  >SEGI_06629    
    MGVGIVKSLAGFMLLLNFCMYMTVAGIAG
    MAVGIVK

Output:
>SEGI_10011    
 MNNCCFMVMRLGGSRSTGRGLKSSEAGE
>SEGI_06629    
MGVGIVKSLAGFMLLLNFCMYMTVAGIAG
MAVGIVK
grep FASTA Sequence • 1.7k views
ADD COMMENT
0
Entering edit mode
3.5 years ago
GenoMax 147k

See : How do I extract Fasta Sequences based on a list of IDs?

My recommendation is to use faSomeRecords from Jim Kent linked in the answer above.

seqkit based answer (from How can I pull out specific protein fastas from one file using information from the protein header? ). Will work for any fasta :

seqkit -w 0 grep -nr -f ids.txt test.fa
ADD COMMENT
0
Entering edit mode

Thank you very much for the information. It works for me.

ADD REPLY

Login before adding your answer.

Traffic: 2340 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6