Question

Is there an elegant way to deal with the version indicator in reference FASTA file

0

Entering edit mode

5 months ago

Xiaokang ▴ 80

I have a FASTA file downloaded from NCBI which is the protein sequences of human from RefSeq. And the ID looks like:

>NP_000019.2 glycogen debranching enzyme isoform 1 [Homo sapiens]
>NP_000021.1 alanine--glyoxylate aminotransferase [Homo sapiens]

So there is a version indicator after "." which is very annoying.

In my case, I have a protein ID list that lists some interesting proteins in my study, and I want to use the ID list to extract the protein sequences from the FASTA file that I downloaded from NCBI. But the protein ID in my list doesn't contain that version indicator, so my ID list file looks like

NP_000019
NP_000021

(just an example, and there are 15,753 IDs in my ID list file with one ID in one line)

I tried some popular tools like fasta-fetch (from MEME) and seqtk, but they all require exact match of ID, so they can't extract anything from the FASTA file with IDs containing ".1", or ".2", etc.

Is there any elegant way to fix that?

FASTA RefSeq • 665 views

ADD COMMENT • link updated 5 months ago by shenwei356 8.7k • written 5 months ago by Xiaokang ▴ 80

2

Entering edit mode

5 months ago

GenoMax 150k

One way:

$ more fake.fa
>NP_000019.2 glycogen debranching enzyme isoform 1 [Homo sapiens]
FAKESEQUENCE_ONE
>NP_000021.1 alanine--glyoxylate aminotransferase [Homo sapiens]
FAKESEQUENCE_NUMBER_TWO

# Using @Pierre's fasta linearization code 

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < fake.fa | grep NP_000021 | tr "\t" "\n" > wanted.fa 

$ more wanted.fa
>NP_000021.1 alanine--glyoxylate aminotransferase [Homo sapiens]
FAKESEQUENCE_NUMBER_TWO

Use a file with list of ID's with grep if you have many.

ADD COMMENT • link 5 months ago by GenoMax 150k

score 2 · Accepted Answer · 2024-10-28

2

Entering edit mode

5 months ago

Michael 55k

Try seqkit grep. It allows partial matches and regular expressions. If you are concerned with variable length, search for e.g. NP_000019\.

ADD COMMENT • link 5 months ago by Michael 55k

2

Entering edit mode

https://bioinf.shenwei.me/seqkit/usage/#sequence-id

$ echo -ne ">NP_000019.2\nactg\n" \
    | seqkit grep --id-regexp '^(\w+)\.?' -f <(echo -ne "NP_000019\nNP_000021\n") 
[INFO] 2 patterns loaded from file
>NP_000019.2
actg

ADD REPLY • link 5 months ago by shenwei356 8.7k

1

Entering edit mode

Thank you! I used the following command to get the job done:

seqkit grep -r -f id_list.txt human_pep.fa > extracted_pep.fa

ADD REPLY • link 5 months ago by Xiaokang ▴ 80

1

Entering edit mode

Using -r with simple IDs might bring some unexpected results. E.g, NP_000019 would match NP_0000192.

ADD REPLY • link 5 months ago by shenwei356 8.7k