Hello everyone, I am analyzing RNA-seq data about Staphylococcus aureus and my GTF file from NCBI RefSeq assembly looks like this.
NZ_RIYS01000060.1 RefSeq gene 3487 4413 . + . gene_id "D1G28_RS14800"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "D1G28_RS14800"; old_locus_tag "D1G28_14800";
NZ_RIYS01000060.1 Protein Homology CDS 3487 4410 . + 0 gene_id "D1G28_RS14800"; transcript_id "unassigned_transcript_4"; Ontology_term "GO:0006260"; Ontology_term "GO:0003677"; Ontology_term "GO:0003887"; Ontology_term "GO:0009360"; gbkey "CDS"; go_component "DNA polymerase III complex|0009360||IEA"; go_function "DNA binding|0003677||IEA"; go_function "DNA-directed DNA polymerase activity|0003887||IEA"; go_process "DNA replication|0006260||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_000344319.1"; locus_tag "D1G28_RS14800"; product "DNA polymerase III subunit delta' C-terminal domain-containing protein"; protein_id "WP_000344337.1"; transl_table "11"; exon_number "1";
NZ_RIYS01000060.1 Protein Homology start_codon 3487 3489 . + 0 gene_id "D1G28_RS14800"; transcript_id "unassigned_transcript_4"; Ontology_term "GO:0006260"; Ontology_term "GO:0003677"; Ontology_term "GO:0003887"; Ontology_term "GO:0009360"; gbkey "CDS"; go_component "DNA polymerase III complex|0009360||IEA"; go_function "DNA binding|0003677||IEA"; go_function "DNA-directed DNA polymerase activity|0003887||IEA"; go_process "DNA replication|0006260||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_000344319.1"; locus_tag "D1G28_RS14800"; product "DNA polymerase III subunit delta' C-terminal domain-containing protein"; protein_id "WP_000344337.1"; transl_table "11"; exon_number "1";
NZ_RIYS01000060.1 Protein Homology stop_codon 4411 4413 . + 0 gene_id "D1G28_RS14800"; transcript_id "unassigned_transcript_4"; Ontology_term "GO:0006260"; Ontology_term "GO:0003677"; Ontology_term "GO:0003887"; Ontology_term "GO:0009360"; gbkey "CDS"; go_component "DNA polymerase III complex|0009360||IEA"; go_function "DNA binding|0003677||IEA"; go_function "DNA-directed DNA polymerase activity|0003887||IEA"; go_process "DNA replication|0006260||IEA"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_000344319.1"; locus_tag "D1G28_RS14800"; product "DNA polymerase III subunit delta' C-terminal domain-containing protein"; protein_id "WP_000344337.1"; transl_table "11"; exon_number "1";
NZ_RIYS01000060.1 RefSeq gene 4414 5217 . + . gene_id "D1G28_RS14805"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "D1G28_RS14805"; old_locus_tag "D1G28_14805";
NZ_RIYS01000060.1 Protein Homology CDS 4414 5214 . + 0 gene_id "D1G28_RS14805"; transcript_id "unassigned_transcript_5"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_499034.1"; locus_tag "D1G28_RS14805"; product "stage 0 sporulation family protein"; protein_id "WP_001134194.1"; transl_table "11"; exon_number "1";
NZ_RIYS01000060.1 Protein Homology start_codon 4414 4416 . + 0 gene_id "D1G28_RS14805"; transcript_id "unassigned_transcript_5"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_499034.1"; locus_tag "D1G28_RS14805"; product "stage 0 sporulation family protein"; protein_id "WP_001134194.1"; transl_table "11"; exon_number "1";
NZ_RIYS01000060.1 Protein Homology stop_codon 5215 5217 . + 0 gene_id "D1G28_RS14805"; transcript_id "unassigned_transcript_5"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_499034.1"; locus_tag "D1G28_RS14805"; product "stage 0 sporulation family protein"; protein_id "WP_001134194.1"; transl_table "11"; exon_number "1";
After mapping and using featureCounts, I obtained the results in a CSV file like this:
D1G28_RS14785,426,411,302,789,306,264
D1G28_RS14790,419,188,369,338,92,67
D1G28_RS14795,1832,1442,1643,2468,1140,1121
D1G28_RS14800,628,537,526,442,453,440
D1G28_RS14805,963,876,767,950,1257,1151
It seems that 'gene_id' is exactly the same as 'locus_tag' in my GTF file.
I have already generated a count matrix and ran DESeq2. In my results file, each line has a 'locus_tag'. I am trying to convert that tag to a gene ID, such as Entrez ID, or Ensembl ID (This cannot be done through biomaRt, I tested it).
I need gene ID to perform GO ontology. Is there any way to convert it?
Thank you very much! Your response helped me understand the problem I was having.
If you don't have a specific reason to use this WGS genome then you should get the top genome here (with a green check mark): https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=1280
That will have gene names and most complete annotation.