Dear All, I have this question but need help answering it using the technical process (From mapping to quantification). I recently annotated a genome of eukaryotic species. So, After combining three methods using EvidanceModeler. The annotation of protein-coding genes yielded a total of 30000 genes. So, I was asked to provide how many of those genes were supported by RNA-seq raw data. Considering that the RNAseq raw data from seq was generated years ago from different tissues from different individuals than the one I assembled and annotated, but all from the same species.
By mapping the raw reads to the CDS sequences of the assembled genome and quantifying the abundance of the raw reads with respect to each CDS sequence, I can find a way to respond to this question. But I still need to think about this approach. Is there any other approach or standardized method to reply to this question?
I really appreciate any help you can provide.
Thanks, GenoMax , From the bioinformatics perspective, did you find my approach logic to broadly answering this request?
I assume you are referring to using old RNAseq data? You should align the reads to entire genome and then see where the reads align and how well they support your gene predictions. Again, a negative result would not mean the prediction is wrong but if the reads align to parts of genome where you did not predict a coding sequence then you will need to check your predictions in that region.