How to: make custom Camellia sinensis var. sinensis (black tea) annotation files for BINGO Cytoscape
BINGO wants a format like this for custom annotation files:
(species=Saccharomyces cerevisiae)(type=Biological Process)(curator=GO)
YAL001C = 0006384
YAL002W = 0045324
YAL002W = 0045324
YAL003W = 0006414
YAL004W = 0000004
YAL005C = 0006616
YAL005C = 0006457
YAL005C = 0000060
YAL007C = 0006888
YAL008W = 0000004
...
(See https://www.psb.ugent.be/cbd/papers/BiNGO/Customize.html)
BEGINNING OF TUTORIAL:
- Go to assembly page: https://www.ncbi.nlm.nih.gov/assembly/GCA_004153795.2
- Download and extract Feature_table file from genbank (see image below)
3.Now in terminal, parse out all tea gene IDs (the directory I am working in is ~/Desktop/biostars/:
cat ~/Desktop/biostars/GCA_004153795.2_AHAU_CSS_2_feature_table.txt | cut -f17 | tr -d '_' | awk '(NR>1)' | sort | uniq > ~/Desktop/biostars/geneids.txt
- Next go to http://teacon.wchoda.com/GOEnrichment and copy and paste the contents of the geneids.txt, select Biological Process (you have to repeat for Molecular Function and Cellular Component using the same gene ids to make three separate files)
pvalue cut off : 1 padjustmethod: FDR (this might not matter, but just in case) qvaluecutoff: 1000
(We just want all the GO annotations that's the purpose of this!)
- Click 'Submit' and on the next page after it loads click 'Export Data'
- Now in R use the following script to parse around the data to get it in the format that BINGO wants:
read in the .csv downloaded from TeaCoN
teaBP <- read.csv("~/Desktop/biostars/GO Enrichmnet - TeaCoN.csv", header = TRUE)
null out unnecessary columns
teaBP$Description <- NULL
teaBP$GeneRatio <- NULL
teaBP$BgRatio <- NULL
teaBP$pvalue <- NULL
teaBP$p.adjust <- NULL
teaBP$qvalue <- NULL
teaBP$Count <- NULL
split up genes in bunched value columns to individual columns
install.packages('splitstackshape')
library('splitstackshape')
teaBPsplit <- cSplit(teaBP, "geneID", sep=" ")
wide to long
install.packages('tidyr')
library(tidyr)
teaBPsplit.long<- pivot_longer(teaBPsplit, 2:388, names_to = "colnames", values_to = "genes")
remove NAs
teaBPsplit.long.noNA<- teaBPsplit.long[!is.na(teaBPsplit.long$genes), ]
remove unnecessary column
teaBPsplit.long.noNA$colnames <- NULL
write your file
write.table(teaBPsplit.long.noNA, file = '~/Desktop/biostars/teaBPannotation.txt', col.names = F, row.names = F)
In terminal rearrange your data:
cat teaBPannotation.txt | tr -d '"' | awk -F " " ' NR>1 { print $3 " " $2 }' > teaBPannotation_clean.txt
Lastly in textedit... replace all .1 GO:
with =
and also don't forget to add:
(species=Camellia sinensis)(type=Biological Process)(curator=GO) at the very top like this:
and then select it in BINGO:
NOTE: if you want to a similar file for Molecular Function and Cellular Component, you should repeat from the TeaCoN step, instead select your desired GO category and also change your header line on the final file (species=Camellia sinensis)(type=CHANGE TO GO CATEGORY)(curator=GO)
Here is the Biological Process annotation file that was generated from this tutorial: https://drive.google.com/file/d/1Gv6M6N_T1e00fTqd4qprNQJgdED_YSt4/view?usp=sharing