Hi all, I'm new in rna-seq analysis and confused about the gene expression profiling. How to obtain the overview gene expression profiling such as how many total number of protein-coding gene, non-coding gene and pseudogene? I tried with the workflow from article: Toward a Reference Gene Catalog of Human Primary Monocytes. (https://doi.org/10.1089/omi.2016.0124)
- FASTQC
- Trimmomatic
- HISAT
- StringTie
- Cuffnorm (The FPKM >0.1 threshold was used to determine expressed transcripts)
- Cuffmerge
- Cuffdiff
This article also reported as by applying an FPKM >0.1 threshold, we have identified a total of 20,371 genes and 82,996 transcripts expressed in our monocyte datasets.
The part I confused is how to applying an FPKM >0.1 threshold and which file should I applied to (cuffnorm output file: gene.fpkm_table or transcript.gtf file)? And how they identified the amount of protein-coding, non-coding and pseudogene from these 20,371 genes?
There have many article reported their result as how much of total genes and transcripts in their datasets, but I really confused how they obtain it.
I really need some help to understand this. Thank you
I don't want to distract from the main answers (since I think you need to do some testing for everybody, meaning you wouldn't lock down the workflow ahead of time). However, in terms of the FPKM threshold, maybe these are relevant:
http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html
http://cdwscience.blogspot.com/2019/02/variance-stabilization-and-pseudocounts.html
You could also do something like require genes to have a certain FPKM threshold for a threshold of samples, which is more like I have here (although that is way messier to look at, and one of the main points is that I think you kind of need to make your own templates to make sure you understand everything that you are doing).