Hello! I'm currently working on RNA-seq analysis. So far, I've used HISAT2 for alignment and HT-seq for counting, and I have obtained a count file with gene IDs and their corresponding counts.
For this process, I've used reference files from Gencode:
- the FASTA file for the genome sequence, primary assembly (GRCh38) region: PRI
- the GTF file for comprehensive gene annotation, region: PRI.
However, my final result includes both coding and non-coding regions.
I would like to create separate files for just coding regions or just non-coding regions. Should I use different reference files for this annotation? If so, could you recommend the appropriate FASTA and GTF files for this purpose?
Thank you!
I initially conducted DEG analysis using raw data without distinguishing between coding and non-coding regions.
However, when examining the volcano plot, I noticed that many lncRNAs were significantly affected by the treatment conditions, which obscured the genes I am most interested in.
Therefore, I am considering reanalyzing the data using only the protein-coding regions to generate a DEG list and proceed with further downstream analyses. Is this approach statistically or analytically problematic?
Thank you!
Genes are tested independently for the DEG analysis. I don't get what do you mean by "obscured the genes I am most interested in": doing an higher number of comparisons (including also ncRNAs) affects only the adjusted P-values, as they're corrected using more P-values.
As stated before, given that the experiment sequenced also non poly-A transcripts, you couldn't proceed without considering them. Just subset the resulting DE table into "coding" and "non-coding", if your study requires it, then proceed with visualization and so on.