I have the following gtf file I want to extract GO id, and associted protein ID of some specific gene id (e.g. - NZ_CP072122.1, etc.) I want to run GO enrichment analysis and KEGG mapping analysis can anyone help me to write some code?
I have differential gene expression data I fetched out genes that are upregulated and down-regulated. As I have the annotation file of this genome and sequences in FASTA format for protein.
Next, I did convert protein ids with BlastKOALA (conversion tool in KEGG) into associated KEGG IDs to map my differentially expressed genes into pathways. That's why I need to extract gene id, GO id, and associated protein ID from the GTF file.
Next, I know some languages beginning level eg. R, bash scripting, python.
If anyone can help me a little more please suggest me some good blog/articles/posts regarding Kegg mapping and GO analysis. Thanks in advance.
NZ_CP072122.1 RefSeq gene 25092 26945 . + . ID=gene-J5P21_RS00130;Name=J5P21_RS00130;gbkey=Gene;gene_biotype=protein_coding;locus_tag=J5P21_RS00130;old_locus_tag=J5P21_00130
NZ_CP072122.1 Protein Homology CDS 25092 26945 . + 0 ID=cds-WP_001278225.1;Parent=gene-J5P21_RS00130;Dbxref=Genbank:WP_001278225.1;Name=WP_001278225.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:YP_004997422.1;locus_tag=J5P21_RS00130;product=ferrous iron transporter B;protein_id=WP_001278225.1;transl_table=11
NZ_CP072122.1 RefSeq gene 26966 27217 . + . ID=gene-J5P21_RS00135;Name=J5P21_RS00135;gbkey=Gene;gene_biotype=protein_coding;locus_tag=J5P21_RS00135;old_locus_tag=J5P21_00135
NZ_CP072122.1 Protein Homology CDS 26966 27217 . + 0 ID=cds-WP_000942501.1;Parent=gene-J5P21_RS00135;Dbxref=Genbank:WP_000942501.1;Name=WP_000942501.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:YP_004997423.1;locus_tag=J5P21_RS00135;product=hypothetical protein;protein_id=WP_000942501.1;transl_table=11
NZ_CP072122.1 RefSeq gene 27378 28724 . + . ID=gene-J5P21_RS00140;Name=murD;gbkey=Gene;gene=murD;gene_biotype=protein_coding;locus_tag=J5P21_RS00140;old_locus_tag=J5P21_00140
NZ_CP072122.1 Protein Homology CDS 27378 28724 . + 0 ID=cds-WP_045544631.1;Parent=gene-J5P21_RS00140;Dbxref=Genbank:WP_045544631.1;Name=WP_045544631.1;Ontology_term=GO:0009252,GO:0008764;gbkey=CDS;gene=murD;go_function=UDP-N-acetylmuramoylalanine-D-glutamate ligase activity|0008764||IEA;go_process=peptidoglycan biosynthetic process|0009252||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:YP_004997424.1;locus_tag=J5P21_RS00140;product=UDP-N-acetylmuramoyl-L-alanine--D-glutamate ligase;protein_id=WP_045544631.1;transl_table=11
NZ_CP072122.1 RefSeq gene 28749 29945 . + . ID=gene-J5P21_RS00145;Name=ftsW;gbkey=Gene;gene=ftsW;gene_biotype=protein_coding;locus_tag=J5P21_RS00145;old_locus_tag=J5P21_00145
NZ_CP072122.1 Protein Homology CDS 28749 29945 . + 0 ID=cds-WP_000907680.1;Parent=gene-J5P21_RS00145;Dbxref=Genbank:WP_000907680.1;Name=WP_000907680.1;Ontology_term=GO:0009252,GO:0051301,GO:0003674,GO:0016020;gbkey=CDS;gene=ftsW;go_component=membrane|0016020||IEA;go_function=molecular_function|0003674||IEA;go_process=peptidoglycan biosynthetic process|0009252||IEA,cell division|0051301||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:YP_004997425.1;locus_tag=J5P21_RS00145;product=putative lipid II flippase FtsW;protein_id=WP_000907680.1;transl_table=11
NZ_CP072122.1 RefSeq gene 29995 30927 . - . ID=gene-J5P21_RS00150;Name=gluQRS;gbkey=Gene;gene=gluQRS;gene_biotype=protein_coding;locus_tag=J5P21_RS00150;old_locus_tag=J5P21_00150
NZ_CP072122.1 Protein Homology CDS 29995 30927 . - 0 ID=cds-WP_000216745.1;Parent=gene-J5P21_RS00150;Dbxref=Genbank:WP_000216745.1;Name=WP_000216745.1;Ontology_term=GO:0043039,GO:0004812;gbkey=CDS;gene=gluQRS;go_function=aminoacyl-tRNA ligase activity|0004812||IEA;go_process=tRNA aminoacylation|0043039||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:YP_004997426.1;locus_tag=J5P21_RS00150;product=tRNA glutamyl-Q(34) synthetase GluQRS;protein_id=WP_000216745.1;transl_table=11
NZ_CP072122.1 RefSeq gene 30930 31466 . - . ID=gene-J5P21_RS00155;Name=dksA;gbkey=Gene;gene=dksA;gene_biotype=protein_coding;locus_tag=J5P21_RS00155;old_locus_tag=J5P21_00155
what did you try so far ?
Thanks for your response, actually I have differential gene expression data I fetched out genes that are upregulated and down-regulated. As I have the annotation file of this genome and sequences in FASTA format for protein.
Next, I did convert protein ids with BlastKOALA (conversion tool in KEGG) into associated KEGG IDs to map my differentially expressed genes into pathways. That's why I need to extract gene id, GO id, and associated protein ID from the GTF file.
Next, I know some languages beginning level eg. R, bash scripting, python.
If anyone can help me a little more please suggest me some good blog/articles/posts regarding Kegg mapping and GO analysis. Thanks in advance.
abhisek061 why did you delete the post?
It is about three days since no one responds to my query and it was showing that my post become red I thought no one can see it that's why I created a new post and deleted that post.
I've deleted your other post. Please do not open multiple posts for the same topic.
What coding languages would you prefer? There are methods for extracting the gene and protein IDs, but KEGG and GO analysis are entirely different efforts. This is similar to "I have a key, can you provide me with a car and show me how to drive?" Are there KEGG and GO resources for this organism? You can't do GO enrichment analysis with a single gtf file on its own.
Sir, please check my response on top of this post I tried to make you understand what I want to do..
Have you tried reading the gtf file with python and gtf_parse? That should give you a data structure with columns for each of the attributes. But if your gtf file is non-standard or problematic, you could simply use python to read it in an parse the 9th field for what you need.