Question

Converting gene_ids from GTF file to common gene symbols

0

Entering edit mode

11 hours ago

Gus • 0

Hi All,

Long time reader, first-time posting. I have been doing RNA-seq analysis for some time now on non-model species. My pipeline was:

Trim and filter reads with bbduk and trimmomatic
Align to genome with HISAT2
Generate counts with featureCounts
Export counts table and load into R for DEG analysis in EdgeR

I currently have gene count tables that have 1 row for each gene id as it appears in the GTF file (taken from NCBI). The problem is that for one species, Chaenocephalus aceratus, all the gene IDs are really locus tags, and products are listed as hypothetical proteins. I would like to be able to compare expression across all species for common genes. But to do this I need to convert these species-specific gene IDs to more common gene symbols.

I also have a mapping file which contains the protein accession IDs. But this species isn't found on common solutions like David, Entrez, Ensemble, etc.

I also have created a fasta file which contains the genome sequences for all the CDS regions in the GTF file, organized by gene id (locus tags). But I'm unsure how I might use these to identify the genes / create a mapping list for gene ontology analysis, etc.

Any help would be greatly appreciated! I'm posting from my phone, but can provide more specific code snippets / examples as needed.

Thanks!

gene_id GTF symbol • 117 views

ADD COMMENT • link updated 4 hours ago by GenoMax 147k • written 11 hours ago by Gus • 0

0

Entering edit mode

But to do this I need to convert these species-specific gene IDs to more common gene symbols.

Probably not what you want to hear but there are no shortcuts to proper annotation so this is not something you can breeze through.

What you may want to do is to complete your DE analysis with ID's you have. Then take the list of locus ID's that are of interest and then spend some time doing manual annotation (including blast, sequence alignments etc) to see if you can assign gene ID's.

ADD REPLY • link 4 hours ago by GenoMax 147k

score 0 · Answer 1 · 2024-11-26

I believe Ensembl uses GenBlast for this purpose - GenBlasting mammalian SwissProt against the reference genome with exon repair turned on to identify genes with cutoffs probably 50% coverage and 50% identity and 5% evalue. You have a gene build already, so you could probably use blastx directly against uniprot/swissprot.