Converting gene_ids from GTF file to common gene symbols
1
0
Entering edit mode
1 day ago
Gus • 0

Hi All,

Long time reader, first-time posting. I have been doing RNA-seq analysis for some time now on non-model species. My pipeline was:

  1. Trim and filter reads with bbduk and trimmomatic
  2. Align to genome with HISAT2
  3. Generate counts with featureCounts
  4. Export counts table and load into R for DEG analysis in EdgeR

I currently have gene count tables that have 1 row for each gene id as it appears in the GTF file (taken from NCBI). The problem is that for one species, Chaenocephalus aceratus, all the gene IDs are really locus tags, and products are listed as hypothetical proteins. I would like to be able to compare expression across all species for common genes. But to do this I need to convert these species-specific gene IDs to more common gene symbols.

I also have a mapping file which contains the protein accession IDs. But this species isn't found on common solutions like David, Entrez, Ensemble, etc.

I also have created a fasta file which contains the genome sequences for all the CDS regions in the GTF file, organized by gene id (locus tags). But I'm unsure how I might use these to identify the genes / create a mapping list for gene ontology analysis, etc.

Any help would be greatly appreciated! I'm posting from my phone, but can provide more specific code snippets / examples as needed.

Thanks!

gene_id GTF symbol • 279 views
ADD COMMENT
2
Entering edit mode

But to do this I need to convert these species-specific gene IDs to more common gene symbols.

Probably not what you want to hear but there are no shortcuts to proper annotation so this is not something you can breeze through.

What you may want to do is to complete your DE analysis with ID's you have. Then take the list of locus ID's that are of interest and then spend some time doing manual annotation (including blast, sequence alignments etc) to see if you can assign gene ID's.

ADD REPLY
0
Entering edit mode

Not trying to breeze through by any means; I've been working on it for quite a while already! But I appreciate the tips. I don't disagree about no shortcuts to proper annotation, but that's a bit outside my current wheelhouse.

Complicating the matter is that these genes have been well-annotated by other researchers, and I have their gene descriptions file as an excel file. The issue is that they used different gene ids than what's made available on NCBI. For instance, gene_id "KUCAC02_015422" (from NCBI .gtf) matches with some other gene_id from their annotation .gff3 file, likely gene_id "CAt00007278", which is PiggyBac transposable element-derived protein. When I blast just the protein accession id for this gene, I see the same thing. The NCBI .gtf file has an attribute "orig_transcript_id" with features like "gnl|WGS:JAMFTG|CAt00007278.1". The last string contains the mapping information to the other annotation file, so there is probably a way I can map the gene names from one file to the other.

What I'm confused by is why, if the annotation has already been completed, is this information not in the NCBI .gtf? Frustrating! And I received the annotation information from those researchers after I ran my pipeline with the file from NCBI, so all my gene names are of the format KUCAC02_015422, making it difficult to map. Worst case, I suppose I can re-run the pipeline on this species to get the other gene names, then complete the DEG analysis, then assign gene_ids from the excel file I was given with the annotation info.

ADD REPLY
1
Entering edit mode

The NCBI .gtf file has an attribute "orig_transcript_id" with features like "gnl|WGS:JAMFTG|CAt00007278.1". The last string contains the mapping information to the other annotation file, so there is probably a way I can map the gene names from one file to the other.

Well there you have it. You can parse that information from the orig_transcript_id - you can pull that out of a gff3 with

cat my.gff3 | grep orig_transcript_id | sed 's/.*gene_id=\([^;]\+\).*orig_transcript_id=\([^;]\+\).*/\1\t\2/g' | tr '|' '\t'
ADD REPLY
0
Entering edit mode

I'll give that a shot. Thanks!

ADD REPLY
0
Entering edit mode

Well, I've successfully mapped my NCBI gene_ids to the other file's gene_ids and annotation info, but I still am lacking common gene symbols to be able to compare expression with other species. I tried doing a search of the descriptions as done in another post (Conversion of full gene description to gene symbols), but I get dozens if not hundreds of hits for each description in many cases, often redundant but sometimes very different. And in other cases, no hits at all...

For example:

        while read i; do 
        echo $i; 
        esearch -db gene -query "$i [TITLE] AND Actinopterygii [ORGN]" < /dev/null | esummary | grep -w "Name"; 
    done < ./description_list_cleaned.txt > gene_symbols.txt

Returns

    ectonucleoside triphosphate diphosphohydrolase 2-like
        <Name>LOC138610486</Name>
        <Name>LOC138633680</Name>
        <Name>entpd2b</Name>
        <Name>LOC136758709</Name>
        <Name>LOC102688617</Name>
        <Name>LOC119891639</Name>
        <Name>LOC129101112</Name>
        <Name>LOC122989871</Name>
.
.
.

        multiple epidermal growth factor-like domains protein 9
            <Name>LOC138622601</Name>
            <Name>LOC138610992</Name>
            <Name>si:dkey-220k22.1</Name>
            <Name>si:ch211-158d24.2</Name>
            <Name>si:ch211-158d24.2</Name>
            <Name>si:ch211-158d24.2</Name>
            <Name>si:ch211-158d24.2</Name>
            <Name>si:ch211-158d24.2</Name>
            <Name>si:dkey-220k22.1</Name>
            <Name>si:ch211-158d24.2</Name>
.
.
.

Clearly entpd2b is what I want for the first one, but as for the others, it's much less clear.

So I'm still left wondering how to take the information I have and simplify to gene symbols that can be more easily compared across species. I'm wondering if a blastx search of all the CDS sequence regions is inevitable to narrow down to a single symbol for each gene.

That said, I appreciate all the help so far!

ADD REPLY
2
Entering edit mode

You can query the protein name in UniProt:

https://rest.uniprot.org/uniprotkb/search?query=protein_name:"multiple epidermal growth factor-like domains protein"&fields=accession,gene_names,protein_name

this gives MEGF[...] as the result. Further down in the list you get an exact match on the fullName, associated with the gene name MEGF9.

{
  "entryType": "UniProtKB reviewed (Swiss-Prot)",
  "primaryAccession": "Q9H1U4",
  "proteinDescription": {
    "recommendedName": {
      "fullName": {
        "value": "Multiple epidermal growth factor-like domains protein 9"
      },
      "shortNames": [
        {
          "value": "Multiple EGF-like domains protein 9"
        }
      ]
    },
    "alternativeNames": [
      {
        "fullName": {
          "value": "Epidermal growth factor-like protein 5"
        },
        "shortNames": [
          {
            "value": "EGF-like protein 5"
          }
        ]
      }
    ],
    "flag": "Precursor"
  },
  "genes": [
    {
      "geneName": {
        "value": "MEGF9"
      },
      "synonyms": [
        {
          "value": "EGFL5"
        },
        {
          "value": "KIAA0818"
        }
      ],
      "orfNames": [
        {
          "value": "UNQ671/PRO1305"
        }
      ]
    }
  ],
  "extraAttributes": {
    "uniParcId": "UPI000045779F"
  }
}
ADD REPLY
0
Entering edit mode

Awesome! Thank you -- I really appreciate all the expertise on this forum

ADD REPLY
1
Entering edit mode
1 day ago
LChart 4.6k

I believe Ensembl uses GenBlast for this purpose - GenBlasting mammalian SwissProt against the reference genome with exon repair turned on to identify genes with cutoffs probably 50% coverage and 50% identity and 5% evalue. You have a gene build already, so you could probably use blastx directly against uniprot/swissprot.

ADD COMMENT

Login before adding your answer.

Traffic: 2420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6