Question

RNA-seq anlaysis > coding region vs. non coding region

0

Entering edit mode

4 months ago

maplewj ▴ 20

Hello! I'm currently working on RNA-seq analysis. So far, I've used HISAT2 for alignment and HT-seq for counting, and I have obtained a count file with gene IDs and their corresponding counts.

For this process, I've used reference files from Gencode:

the FASTA file for the genome sequence, primary assembly (GRCh38) region: PRI
the GTF file for comprehensive gene annotation, region: PRI.

However, my final result includes both coding and non-coding regions.

I would like to create separate files for just coding regions or just non-coding regions. Should I use different reference files for this annotation? If so, could you recommend the appropriate FASTA and GTF files for this purpose?

Thank you!

non-coding coding RNA-seq • 560 views

ADD COMMENT • link updated 4 months ago by Shred ★ 1.6k • written 4 months ago by maplewj ▴ 20

Ram · Answer 1 · 2024-08-01

1

Entering edit mode

4 months ago

Shred ★ 1.6k

Which is the biological question?

You're analyzing a dataset generated using a protocol designed to do not profile just protein coding genes (I suppose, given results..). Subsetting the alignment space to the protein coding section of the transcriptome is a bad choice.

If you want to do a differential expression analysis, proceed with the counts from htseq with all the transcribed regions, as this is used for estimating library size, dispersions and other parameters. Do the DE analysis and then use the results for your purpose.

You could obtain the list of protein coding genes in various methods, depending on the reference annotation you're using. If it's Ensembl, use bioMart

ADD COMMENT • link updated 4 months ago by Ram 44k • written 4 months ago by Shred ★ 1.6k

0

Entering edit mode

I initially conducted DEG analysis using raw data without distinguishing between coding and non-coding regions.

However, when examining the volcano plot, I noticed that many lncRNAs were significantly affected by the treatment conditions, which obscured the genes I am most interested in.

Therefore, I am considering reanalyzing the data using only the protein-coding regions to generate a DEG list and proceed with further downstream analyses. Is this approach statistically or analytically problematic?

Thank you!

ADD REPLY • link 4 months ago by maplewj ▴ 20

0

Entering edit mode

Genes are tested independently for the DEG analysis. I don't get what do you mean by "obscured the genes I am most interested in": doing an higher number of comparisons (including also ncRNAs) affects only the adjusted P-values, as they're corrected using more P-values.

As stated before, given that the experiment sequenced also non poly-A transcripts, you couldn't proceed without considering them. Just subset the resulting DE table into "coding" and "non-coding", if your study requires it, then proceed with visualization and so on.

ADD REPLY • link 4 months ago by Shred ★ 1.6k