Coefficient of variation
2
0
Entering edit mode
7.2 years ago
nicoles ▴ 10

I am a newb and I come from a background of we lab experience. Recently, we have started doing RNA-Seq. Originally, our bioinformatics core was going to handle analysis and then that person went on sabbatical. I started using Galaxy to analyze our data. My PI has set parameters (based off the literature) before proceeding with GO terms. One of the conditions is only including genes with a CV of less than or equal to 0.5. Can I do this in Galaxy? If not, could some please tell me how I could do so manually.

I went through Tophat, cufflinks, cuffcompare, cuffdiff based off a colleagues recommendation. I also have a separate workflow of htseq-count then DESeq2.

Any help will be greatly appreciated.

Thanks!

RNA-Seq Galaxy Coefficient of variation. • 9.6k views
ADD COMMENT
3
Entering edit mode
7.2 years ago
Renesh ★ 2.2k

The CV calculations are necessary if you want to select stable and consistently expressed genes from your RNA-seq datasets. The CV calculation is very straightforward and involves standard deviation and mean. CV = SD/Mean. The CV will give you the extent of variability in your gene expression dataset. Your PI is telling to include the genes which are stably expressed across replicates/experiments as the CV is low (0.5).

I am not sure Galaxy do basic statistical calculation with the table data. To calculate CV, you can use database like psql or Excel. You can use CV calculations on htseq-count raw data and then proceed to DESeq package. Most of the gene epression packages calculate the dispersion which accounts for CV.

ADD COMMENT
0
Entering edit mode

Thank you. I'll calculate with the htseq-count. Is it also acceptable to calculate the stdev and mean from the cufflinks FPKM? For my own understanding and further explanation to my PI

ADD REPLY
0
Entering edit mode

Yes, you can also calculate CV from FPKM. FPKM is also a normalized count.

ADD REPLY
0
Entering edit mode

I want to extract unstable/inconsistently expressed genes from gene expression data, and I used CV as follow:

SD <- apply(eset_HTA20,1, sd)
CV <- base::sqrt(exp(SD^2)-1)

but I got this unusual result in terms of CV value range:

> summary(CV)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.04753  0.12946  0.16494  0.20181  0.22925 15.00777

I think CV should not be more than 1, please correct me. Plus, How can I retain the genes which show a high amount of variation in terms of gene expression level? Any idea?

ADD REPLY
0
Entering edit mode
7.2 years ago
nicoles ▴ 10

Thank you for replying Kevin. I am trying to learn bioinformatics for myself and our lab. It is definitely and essential skill to have. With obtaining the raw counts from my RNA-Seq samples from Kallisto, can I then determine differentially expressed genes with DESeq2? Could I use DESeq2 through Galaxy after I obtain the counts in Kallisto? Thanks!

ADD COMMENT
0
Entering edit mode

I hope that a tool like Galaxy accepts Kallisto-derived counts, or at best a custom matrix of counts. However, if the HT-seq option is already built-into Galaxy, then you should stick to HT-seq. As far as I recall, you'll therefore have to align the reads to produce a BAM file, over which HT-seq counts transcript abundances (Kallisto and other modern tools don't require a BAM alignment).

There is a great thread here for RNA-seq and Galaxy, which you may have already seen: https://galaxyproject.org/tutorials/rb_rnaseq/

ADD REPLY
1
Entering edit mode

Yes, I did need the BAM files for ht-seq count. As there will be more RNA-seq coming, I would like to know quicker methods of quantification. In the near future I'll find out if Galaxy accepts the Kallisto counts. The tutorial has greatly helped .

ADD REPLY

Login before adding your answer.

Traffic: 2335 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6