Is there an elegant way to deal with the version indicator in reference FASTA file
2
0
Entering edit mode
8 weeks ago
Xiaokang ▴ 80

I have a FASTA file downloaded from NCBI which is the protein sequences of human from RefSeq. And the ID looks like:

>NP_000019.2 glycogen debranching enzyme isoform 1 [Homo sapiens]
>NP_000021.1 alanine--glyoxylate aminotransferase [Homo sapiens]

So there is a version indicator after "." which is very annoying.

In my case, I have a protein ID list that lists some interesting proteins in my study, and I want to use the ID list to extract the protein sequences from the FASTA file that I downloaded from NCBI. But the protein ID in my list doesn't contain that version indicator, so my ID list file looks like

NP_000019
NP_000021

(just an example, and there are 15,753 IDs in my ID list file with one ID in one line)

I tried some popular tools like fasta-fetch (from MEME) and seqtk, but they all require exact match of ID, so they can't extract anything from the FASTA file with IDs containing ".1", or ".2", etc.

Is there any elegant way to fix that?

FASTA RefSeq • 440 views
ADD COMMENT
2
Entering edit mode
8 weeks ago
Michael 55k

Try seqkit grep. It allows partial matches and regular expressions. If you are concerned with variable length, search for e.g. NP_000019\.

ADD COMMENT
2
Entering edit mode

https://bioinf.shenwei.me/seqkit/usage/#sequence-id

$ echo -ne ">NP_000019.2\nactg\n" \
    | seqkit grep --id-regexp '^(\w+)\.?' -f <(echo -ne "NP_000019\nNP_000021\n") 
[INFO] 2 patterns loaded from file
>NP_000019.2
actg
ADD REPLY
1
Entering edit mode

Thank you! I used the following command to get the job done:

seqkit grep -r -f id_list.txt human_pep.fa > extracted_pep.fa
ADD REPLY
1
Entering edit mode

Using -r with simple IDs might bring some unexpected results. E.g, NP_000019 would match NP_0000192.

ADD REPLY
2
Entering edit mode
8 weeks ago
GenoMax 148k

One way:

$ more fake.fa
>NP_000019.2 glycogen debranching enzyme isoform 1 [Homo sapiens]
FAKESEQUENCE_ONE
>NP_000021.1 alanine--glyoxylate aminotransferase [Homo sapiens]
FAKESEQUENCE_NUMBER_TWO

# Using @Pierre's fasta linearization code 

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < fake.fa | grep NP_000021 | tr "\t" "\n" > wanted.fa 

$ more wanted.fa
>NP_000021.1 alanine--glyoxylate aminotransferase [Homo sapiens]
FAKESEQUENCE_NUMBER_TWO

Use a file with list of ID's with grep if you have many.

ADD COMMENT

Login before adding your answer.

Traffic: 1810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6