Question

How to extract Locus tag from GeneIDs in NCBI for soybean database.

0

Entering edit mode

3.9 years ago

b.g.tamang ▴ 20

Hi all, I have been trying to get information on converting NCBI GeneID to Glyma ID for soybean gene annotation. However it seems like such file does not exist. For instance, GeneID:100790502 search in NCBI shows Locus tag for this id as GLYMA_09G197400 which is what I want to extract but for all 56K soybean genes. Is there a way to query all 56K NCBI format GeneIDs and extract the Locus tag value? That way, I can use Glyma IDs and annotate them using phytozome annotation file. Phytozome already has their genes in Glyma format and I am not sure why NCBI does not have this information in their fasta or gff/gtf files.

Your insight is much appreciated.

Best,

RNASeq Soybean tag Locus ID Glyma • 2.5k views

ADD COMMENT • link 3.9 years ago by b.g.tamang ▴ 20

score 1 · Answer 1 · 2021-07-10

You can use EntrezDirect. First column as Entrez gene ID and second Locus tag :

$ esearch -db gene -query 100790502  | esummary | xtract -pattern DocumentSummary -element Id,OtherAliases
100790502   GLYMA_09G197400

This may get you most of them. Showing only 10 here (remove | head -10 to get them all) :

$ esearch -db gene -query GLYMA | esummary | xtract -pattern DocumentSummary -element Id,OtherAliases | head -10
547923  GLYMA_13G347600, L-1, Lx1
548076  GLYMA_13G288100
547831  GLYMA_08G341500, KTi, Ti-a, Ti-b, Tia, Tic, Tie
100788438   GLYMA_03G181700, GmPAL1.2, PAL1
547900  GLYMA_03G163500, A2B1a, glycinin
547641  GLYMA_18G023500, RLK-RHG1, rhg1-like, rhg1g, rhg1s
547931  GLYMA_06G301500, BMY1, Gm-BamyDam, Gm-BamyKza
100787872   GLYMA_02G309300, GmPAL3.1
100527427   GLYMA_10G199100, GmLb, N-2, Nodulin-2
547869  GLYMA_15G026300, L-3, LOX1.3, Lx3

score 1 · Answer 2 · 2021-07-10

1

Entering edit mode

3.9 years ago

vkkodali_ncbi ★ 3.8k

The file gene_info.gz on the Gene FTP site has this information. Since you are interested in soybean only, you can download the All_Plants.gene_info.gz file from here. On a Unix command line, you can extract these as follows:

zcat All_Plants.gene_info.gz | awk 'BEGIN{FS="\t";OFS="\t"}(($1~/^#/||$1==3847) && $4~/GLYMA/)'

ADD COMMENT • link 3.9 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

This is great information and saved me a lot of headache. Thanks a lot. Appreciate it. Best.

ADD REPLY • link 3.9 years ago by b.g.tamang ▴ 20