Hi everyone, I'm new to biostars, and signed up specifically for this issue. I'm at a loss on how to proceed with my RNA-seq pipeline following annotation of novel transcripts that reveal they are part of previously annotated genes from my gff file. I am stuck working with a non-model fish species with a draft genome and draft annotation.
My differential expression pipeline so far has been: STAR (with public gff) -> Stringtie + Stringtie-merge -> 2nd run of STAR (with merged gff) -> Stringtie -> prepDE.py (to get counts table) -> DESeq2
In parallel, I've been trying to annotate the novel transcripts detected by STAR so I can run GO and KEGG analysis on differentially expressed genes. To annotate, I used gffread on the merged GFF, submitted novel transcript nucleotide sequences to ORF finder, conducted Blastx analysis for all fish refseq protein sequences, then submitted all protein sequences from ORF finder with blastx e-values < 1E-5 to eggNOG, trembl, and several other annotation databases.
After looking at the annotations, I'm realizing many annotations from these novel transcripts detected by STAR from alignment to the draft genome are annotating as genes in the original unmerged gff file. This seems to be the results of poor annotation of the genome (missed exons, etc) since many blast with nearly 100% sequence identity to these same genes in sister species.
For the genes that appear as duplicate entries due to incomplete annotation of the draft genome, how do I handle these results for differential expression analysis and GO/KEGG analysis? If the actual read counts for a single gene are split among multiple annotations, is there a way to combine them either using FPKM or counts so I'm using more accurate expression results? Then when I get to the enrichment step, I also don't want to have duplicate entries for single genes in the background or test set of transcripts.
Thank you in advance for your help.