Convert "Locus Tag" to a RefSeq ID or ENSEMBL ID
1
0
Entering edit mode
3 months ago
min • 0

Hello everyone, I am analyzing RNA-seq data about Staphylococcus aureus and my GTF file from NCBI RefSeq assembly looks like this.

NZ_RIYS01000060.1   RefSeq  gene    3487    4413    .   +   .   gene_id "D1G28_RS14800"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "D1G28_RS14800"; old_locus_tag "D1G28_14800";
NZ_RIYS01000060.1   Protein Homology    CDS 3487    4410    .   +   0   gene_id "D1G28_RS14800"; transcript_id "unassigned_transcript_4"; Ontology_term "GO:0006260"; Ontology_term "GO:0003677"; Ontology_term "GO:0003887"; Ontology_term "GO:0009360"; gbkey "CDS"; go_component "DNA polymerase III complex|0009360||IEA"; go_function "DNA binding|0003677||IEA"; go_function "DNA-directed DNA polymerase activity|0003887||IEA"; go_process "DNA replication|0006260||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_000344319.1"; locus_tag "D1G28_RS14800"; product "DNA polymerase III subunit delta' C-terminal domain-containing protein"; protein_id "WP_000344337.1"; transl_table "11"; exon_number "1";
NZ_RIYS01000060.1   Protein Homology    start_codon 3487    3489    .   +   0   gene_id "D1G28_RS14800"; transcript_id "unassigned_transcript_4"; Ontology_term "GO:0006260"; Ontology_term "GO:0003677"; Ontology_term "GO:0003887"; Ontology_term "GO:0009360"; gbkey "CDS"; go_component "DNA polymerase III complex|0009360||IEA"; go_function "DNA binding|0003677||IEA"; go_function "DNA-directed DNA polymerase activity|0003887||IEA"; go_process "DNA replication|0006260||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_000344319.1"; locus_tag "D1G28_RS14800"; product "DNA polymerase III subunit delta' C-terminal domain-containing protein"; protein_id "WP_000344337.1"; transl_table "11"; exon_number "1";
NZ_RIYS01000060.1   Protein Homology    stop_codon  4411    4413    .   +   0   gene_id "D1G28_RS14800"; transcript_id "unassigned_transcript_4"; Ontology_term "GO:0006260"; Ontology_term "GO:0003677"; Ontology_term "GO:0003887"; Ontology_term "GO:0009360"; gbkey "CDS"; go_component "DNA polymerase III complex|0009360||IEA"; go_function "DNA binding|0003677||IEA"; go_function "DNA-directed DNA polymerase activity|0003887||IEA"; go_process "DNA replication|0006260||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_000344319.1"; locus_tag "D1G28_RS14800"; product "DNA polymerase III subunit delta' C-terminal domain-containing protein"; protein_id "WP_000344337.1"; transl_table "11"; exon_number "1";
NZ_RIYS01000060.1   RefSeq  gene    4414    5217    .   +   .   gene_id "D1G28_RS14805"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "D1G28_RS14805"; old_locus_tag "D1G28_14805";
NZ_RIYS01000060.1   Protein Homology    CDS 4414    5214    .   +   0   gene_id "D1G28_RS14805"; transcript_id "unassigned_transcript_5"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_499034.1"; locus_tag "D1G28_RS14805"; product "stage 0 sporulation family protein"; protein_id "WP_001134194.1"; transl_table "11"; exon_number "1";
NZ_RIYS01000060.1   Protein Homology    start_codon 4414    4416    .   +   0   gene_id "D1G28_RS14805"; transcript_id "unassigned_transcript_5"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_499034.1"; locus_tag "D1G28_RS14805"; product "stage 0 sporulation family protein"; protein_id "WP_001134194.1"; transl_table "11"; exon_number "1";
NZ_RIYS01000060.1   Protein Homology    stop_codon  5215    5217    .   +   0   gene_id "D1G28_RS14805"; transcript_id "unassigned_transcript_5"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_499034.1"; locus_tag "D1G28_RS14805"; product "stage 0 sporulation family protein"; protein_id "WP_001134194.1"; transl_table "11"; exon_number "1";

After mapping and using featureCounts, I obtained the results in a CSV file like this:

D1G28_RS14785,426,411,302,789,306,264
D1G28_RS14790,419,188,369,338,92,67
D1G28_RS14795,1832,1442,1643,2468,1140,1121
D1G28_RS14800,628,537,526,442,453,440
D1G28_RS14805,963,876,767,950,1257,1151

It seems that 'gene_id' is exactly the same as 'locus_tag' in my GTF file.

I have already generated a count matrix and ran DESeq2. In my results file, each line has a 'locus_tag'. I am trying to convert that tag to a gene ID, such as Entrez ID, or Ensembl ID (This cannot be done through biomaRt, I tested it).

I need gene ID to perform GO ontology. Is there any way to convert it?

RNA-seq • 455 views
ADD COMMENT
0
Entering edit mode
3 months ago
GenoMax 148k

You can use EntrezDirect to get gi numbers, which is what you may be referring to as EntrezID. gi numbers have been deprecated for end-user use so they may not help you in long run.

$ esearch -db nuccore -query "D1G28_RS14800" | efetch -format docsum | xtract -pattern DocumentSummary -element Id
1511859561

These ID's have been discussed in a prior thread: Convert Gene ID To Gene Name .

GTF file from NCBI RefSeq

This is not a RefSeq genome. This is a shotgun genome (https://www.ncbi.nlm.nih.gov/nuccore/NZ_RIYS01000060 ) so the annotation is likely done using automated means thus there are no gene names.

There are bound to be RefSeq genomes for S. aureus that will have gene names. Unfortunately NCBI site is generating a "500 server error" for genomes page so I can't post a direct link.

ADD COMMENT
0
Entering edit mode

Thank you very much! Your response helped me understand the problem I was having.

ADD REPLY
0
Entering edit mode

If you don't have a specific reason to use this WGS genome then you should get the top genome here (with a green check mark): https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=1280

That will have gene names and most complete annotation.

ADD REPLY

Login before adding your answer.

Traffic: 1782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6