extracting ensembl gene id or entrez gene id from a gene annotation file for BINGO app of cytoscape
1
0
Entering edit mode
3.5 years ago

I am working with Camellia Sinensis or black tea species, consequently, I got few genes that are expressed differentially. I have its genome annotation file of this species but I can not get differentially gene's ENSEMBLE or ENTREZ gene id manually or even I tried PANTHER, DAVID online tool for gene id conversion but didn't become fruitful because of maybe an uncommon working species. Please help me.

Gene id is quite like following-
TEA_016967, TEA_010081, TEA_002547, TEA_015527, TEA_019823

My genome annotation file is in gtf file format I cannot extract ensemble id /entraz id please help me the gtf file is like the following -

SDRB02000004.1 Genbank gene 6018 10396 . + . gene_id "TEA_012962"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "TEA_012962";

SDRB02000004.1 Genbank transcript 6018 10396 . + . gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; gbkey "mRNA"; locus_tag "TEA_012962";orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA";

SDRB02000004.1 Genbank exon 6018 6864 . + . gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; exon_number "1";

SDRB02000004.1 Genbank exon 7548 7685 . + . gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; exon_number "2";

SDRB02000004.1 Genbank exon 7802 7923 . + . gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; exon_number "3";

BINGO analysis RNA-seq R GO • 2.4k views
ADD COMMENT
1
Entering edit mode
3.5 years ago
Pratik ★ 1.1k

EDITED POST:

Okay, this should get you started to extract the information you need.

If the contents of your gtf file looks like this:

SDRB02000004.1 Genbank transcript 6018 10396 . + . gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; gbkey "mRNA"; locus_tag "TEA_012962";orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA";

SDRB02000004.1 Genbank exon 6018 6864 . + . gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; exon_number "1";

SDRB02000004.1 Genbank exon 7548 7685 . + . gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; exon_number "2";

SDRB02000004.1 Genbank exon 7802 7923 . + . gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; exon_number "3";

and then in terminal you type:

 cat your_gtf_file.gtf | cut -d " " -f10 | tr -d '"' | tr -d ';' | tr -d '_' > yourgenes.txt

Your output should be:

TEA012962

TEA012962

TEA012962

TEA012962

Then you need to copy and paste your list from the yourgenes.txt to here for GO enrichment (don't worry about the ".1" after the gene id. It will automatically be placed by the website":

http://teacon.wchoda.com/GOEnrichment

Once it's enriched click "Data Export" to download the file.

Then you have to follow this tutorial to get it into BINGO:

https://www.psb.ugent.be/cbd/papers/BiNGO/Customize.html

Good luck and hope this helps!

Please reach out if you need help with the final step of getting it from that GO Enrichment - TeaCoN csv file to BINGO.

ADD COMMENT
1
Entering edit mode

Thank you for helping, I am facing a issue for GO enrichment analysis that i have approx 20 differentially expressed genes I want to analysis them through BINGO-cytoscape app but problem I'm facing The reference custom annotation file is not supporting which is in gtf file the gene set is provided which are as genebank id. significance level is choosen 0.05 Reference annotation file is choosen custom gtf file of Camellia sinensis assembly 2. Thank you!

ADD REPLY
1
Entering edit mode

Hey Abhisek,

I think the reference annotation file you are uploading might be wrong. I beleive you are uploading something you downloaded directly from here: https://www.ncbi.nlm.nih.gov/assembly/GCA_004153795.2

See a tutorial I wrote regarding the custom annotation file that you have to make:

How to: make Camellia sinensis var. sinensis (black tea) custom annotation files for BINGO Cytoscape

I provided a biological process custom annotation file on the bottom of the tutorial in a google drive link. If you want to use Molecular Function and Cellular Component Categories you will have to repeat the tutorial from the TeaCoN step.

Hope this helps and good luck!

ADD REPLY
0
Entering edit mode

Thank you again for your valuable response but I want to say you the following starting command is not working I changed it as my working directory where feature table is present-

cat ~/Desktop/biostars/GCA_004153795.2_AHAU_CSS_2_feature_table.txt | cut -f17 | tr -d '_' | awk '(NR>1)' | sort | uniq > ~/Desktop/biostars/geneids.txt

in case of me -

cat ~/home/abhisek/Documents/workingdir/GCA_004153795.2_AHAU_CSS_2_feature_table.txt | cut -f17 | tr -d '_' | awk '(NR>1)' | sort | uniq > ~/home/abhisek/Documents/workingdir/geneids.txt

The error is following bash: /home/abhisek/home/abhisek/Documents/workingdir/geneids.txt: No such file or directory cat: /home/abhisek/home/abhisek/Documents/workingdir/GCA_004153795.2_AHAU_CSS_2_feature_table.txt: No such file or directory

Please let me know why it is not working in my linux pc?

ADD REPLY
0
Entering edit mode

Hy abhisek, you typed

~/home/abhisek/Documents/workingdir/GCA_004153795.2_AHAU_CSS_2_feature_table.txt~/home/abhisek/Documents/workingdir/GCA_004153795.2_AHAU_CSS_2_feature_table.txt

however because you typed ~ before your file path.

It repeated the file path shortcut.

In other words, in your case ~ = /home/abhisek

This is shown in your error:

The error is following bash: /home/abhisek/home/abhisek/Documents/workingdir/geneids.txt: No such file or directory cat: /home/abhisek/home/abhisek/Documents/workingdir/GCA_004153795.2_AHAU_CSS_2_feature_table.txt: No such file or directory

where the file path repeats /home/abihsek twice before listing the file path.

Your solution would be to either remove ~ or remove /home/abhisek because they mean the same thing.

Hope this helps!

ADD REPLY
0
Entering edit mode

Hello Pratik Sir, I have followed your tutorial to get the result in your code I think minor mistakes happened because I do not get the expected result cytoscape. Once just follow my Gene id and annotation file what annotation file you made their gene id associated to GO ID is not matched to my inputted Gene id.

With explanation - my inputted gene id is following TEA_016967, TEA_010081, TEA_002547, TEA_015527, TEA_019823

But in tutorial, u made the GO ID associated to Gene id is following - TEA002763 = 0000028 TEA006828 = 0000028 TEA006848 = 0000028 TEA008332 = 0000028 TEA020472 = 0000028 TEA024267 = 0000028 TEA027695 = 0000028

That's when my input not showing any result. Thanks Pratik Sir.

ADD REPLY
0
Entering edit mode

I think you might just simply have to remove the underscores _

You could do that manually or in terminal like so:

cat yourgenes.txt > tr -d "_"

and then copy and paste this input.

So instead of gene names like TEA_016967, TEA_010081, TEA_002547, TEA_015527, TEA_019823

you would do TEA016967, TEA010081, so on and so forth.

ADD REPLY
0
Entering edit mode

Respected Sir, You helped a lot. meanwhile, I find the actual problem, These are the common name of the genes - TEA_016967, TEA_010081, TEA_002547, TEA_015527, TEA_019823 but those are not associated to GO ID but TEA016967, TEA010081 these are are linked to GO so follow my given annotation data if you write a small script that takes common names and provides GO linked associated ID as output it will be very helpful. Thanks again

ADD REPLY
1
Entering edit mode

You're welcome. Remember to pay it forward by helping others : )

I think I understand what you want. You want to input genes with the underscore like below into BINGO:

TEA_016967, TEA_010081, etc...

instead of changing them to this for your BINGO input:

TEA016967, TEA010081, etc...

You can do do this easily through a basic text editor. In your gene annotation file that looks like this:

enter image description here

or that looks like this:

enter image description here

You can use the Replace All or Find & Replace .

You can do Replace All :

TEA to TEA_

ADD REPLY

Login before adding your answer.

Traffic: 1643 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6