I have a FASTA file downloaded from NCBI which is the protein sequences of human from RefSeq. And the ID looks like:
>NP_000019.2 glycogen debranching enzyme isoform 1 [Homo sapiens]
>NP_000021.1 alanine--glyoxylate aminotransferase [Homo sapiens]
So there is a version indicator after "." which is very annoying.
In my case, I have a protein ID list that lists some interesting proteins in my study, and I want to use the ID list to extract the protein sequences from the FASTA file that I downloaded from NCBI. But the protein ID in my list doesn't contain that version indicator, so my ID list file looks like
NP_000019
NP_000021
(just an example, and there are 15,753 IDs in my ID list file with one ID in one line)
I tried some popular tools like fasta-fetch (from MEME) and seqtk, but they all require exact match of ID, so they can't extract anything from the FASTA file with IDs containing ".1", or ".2", etc.
Is there any elegant way to fix that?
https://bioinf.shenwei.me/seqkit/usage/#sequence-id
Thank you! I used the following command to get the job done:
Using
-r
with simple IDs might bring some unexpected results. E.g,NP_000019
would matchNP_0000192
.