I am a newb and I come from a background of we lab experience. Recently, we have started doing RNA-Seq. Originally, our bioinformatics core was going to handle analysis and then that person went on sabbatical. I started using Galaxy to analyze our data.
My PI has set parameters (based off the literature) before proceeding with GO terms. One of the conditions is only including genes with a CV of less than or equal to 0.5. Can I do this in Galaxy? If not, could some please tell me how I could do so manually.
I went through Tophat, cufflinks, cuffcompare, cuffdiff based off a colleagues recommendation. I also have a separate workflow of htseq-count then DESeq2.
The CV calculations are necessary if you want to select stable and consistently expressed genes from your RNA-seq datasets. The CV calculation is very straightforward and involves standard deviation and mean. CV = SD/Mean. The CV will give you the extent of variability in your gene expression dataset. Your PI is telling to include the genes which are stably expressed across replicates/experiments as the CV is low (0.5).
I am not sure Galaxy do basic statistical calculation with the table data. To calculate CV, you can use database like psql or Excel. You can use CV calculations on htseq-count raw data and then proceed to DESeq package. Most of the gene epression packages calculate the dispersion which accounts for CV.
Thank you. I'll calculate with the htseq-count. Is it also acceptable to calculate the stdev and mean from the cufflinks FPKM? For my own understanding and further explanation to my PI
but I got this unusual result in terms of CV value range:
> summary(CV)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.04753 0.12946 0.16494 0.20181 0.22925 15.00777
I think CV should not be more than 1, please correct me. Plus, How can I retain the genes which show a high amount of variation in terms of gene expression level? Any idea?
Thank you for replying Kevin. I am trying to learn bioinformatics for myself and our lab. It is definitely and essential skill to have. With obtaining the raw counts from my RNA-Seq samples from Kallisto, can I then determine differentially expressed genes with DESeq2? Could I use DESeq2 through Galaxy after I obtain the counts in Kallisto? Thanks!
I hope that a tool like Galaxy accepts Kallisto-derived counts, or at best a custom matrix of counts. However, if the HT-seq option is already built-into Galaxy, then you should stick to HT-seq. As far as I recall, you'll therefore have to align the reads to produce a BAM file, over which HT-seq counts transcript abundances (Kallisto and other modern tools don't require a BAM alignment).
Yes, I did need the BAM files for ht-seq count. As there will be more RNA-seq coming, I would like to know quicker methods of quantification. In the near future I'll find out if Galaxy accepts the Kallisto counts.
The tutorial has greatly helped .
Thank you. I'll calculate with the htseq-count. Is it also acceptable to calculate the stdev and mean from the cufflinks FPKM? For my own understanding and further explanation to my PI
Yes, you can also calculate CV from FPKM. FPKM is also a normalized count.
I want to extract unstable/inconsistently expressed genes from
gene expression data
, and I usedCV
as follow:but I got this unusual result in terms of
CV
value range:I think
CV
should not be more than 1, please correct me. Plus, How can I retain the genes which show a high amount of variation in terms of gene expression level? Any idea?