Question

Resolve RefSeq ID's to Functional Annotations

0

Entering edit mode

2.6 years ago

erik.burchard ▴ 30

Hello all,

I have a .csv file containing both gene IDs and the corresponding RefSeq ID (in the XP_ format) that I got from a genome annotation pipeline that I used to annotate a fungal genome that we sequenced. I would like to be able to find and add functional annotation descriptions (e.g. "isocitrate/isopropylmalate dehydrogenase") to each gene. My table is in the following format:

LS400g001490.m01 XP_007679962.1 LS400g001500.m01 XP_033673675.1 LS400g001500.m02 XP_007678865.1 LS400g001510.m01 XP_024777452.1 LS400g001510.m02 XP_031885916.1 LS400g001510.m03 XP_031885916.1 LS400g001520.m01 XP_033388258.1 LS400g001530.m01 XP_035366954.1 LS400g001540.m01 XP_001806663.1 LS400g001550.m01 XP_007699016.1

I can't really use a species-specific database of the sort that you might find in biomaRt since these accessions are from NCBIs entire fungal database. Is there a file or a tool that I can use to accomplish this that anyone is aware of?

Thanks very much!

Annotation RefSeq NCBI • 564 views

ADD COMMENT • link 2.6 years ago by erik.burchard ▴ 30

score 2 · Answer 1 · 2022-04-12

Using Entrezdirect:

$ more id
XP_007679962.1 
XP_033673675.1  
XP_007678865.1 
XP_024777452.1

$ for i in `cat id`; do printf ${i}"\t"; esearch -db protein -query ${i} | esummary | xtract -pattern DocumentSummary -element Title; done
XP_007679962.1  uncharacterized protein BAUCODRAFT_77314 [Baudoinia panamericana UAMH 10762]
XP_033673675.1  uncharacterized protein M409DRAFT_16747 [Zasmidium cellare ATCC 36951]
XP_007678865.1  uncharacterized protein BAUCODRAFT_150031 [Baudoinia panamericana UAMH 10762]
XP_024777452.1  hypothetical protein M431DRAFT_78925 [Trichoderma harzianum CBS 226.95]

From NCBI:

Accession numbers that begin with the prefix XM_ (mRNA), XR_ (non-coding RNA), and XP_ (protein) are model RefSeqs produced either by NCBI’s genome annotation pipeline or copied from computationally annotated submissions to the INSDC. These RefSeq records are derived from the genome sequence and have varying levels of transcript or protein homology support.

So functional annotations are going to be few and far between as shown by example above.