Hi,
I'm a PhD student very new to bioinformatics and I'm getting really confused about the best way to do differential gene analysis on TCGA data.
First, I planned to use htseq-counts downloaded from xenabrowser by transforming them back and rounding to integers round(((2^x) - 1), 0)
. Is rounding the counts acceptable practice for the input to DESeq2? I've read this thread A: Normalisation of RNAseq data from UCSC Xena Browser, and I guess it should be ok, but I remember I stumbled on another thread where the conclusion was different (sorry, but I can't find it now), hence I started wondering if it's acceptable after all.
Another option I considered was to use tximport
to read the transcript RSEM expected counts from the TOIL project, but there is no information about the transcript/effective length and I don't know how I can get it. There are also RSEM expected counts at the gene level, but I still can't use it without knowledge about the transcript length
tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE) :
all(c(abundanceCol, countsCol, lengthCol) %in% names(raw)) is not TRUE In addition: Warning message: Unnamedcol_types
should have the same length ascol_names
. Using smaller of the two.
Is it possible to obtain the transcript/effective lengths based on ENST ids? Or can you only do it with raw data? If it's not possible, then is it acceptable to use htseq-counts as described above? What's the best practice for DEG analysis of the publicly available TCGA data?
I'm sorry for perhaps stupid questions but I've read numerous threads and couldn't come to any conclusion. Thank you for help!
I cannot find the thread from
support.bioconductor.org
right now but if memory serves theDESeq2
developer at some point stated that rounding floats to integers is ok since the difference from a float to its nearest integer is tiny while true biological changes between groups are expected to be much larger, so rounding has basically no influence. Therefore converting back to normal scale, rounding and putting into DESeq2 should be ok.Thank you! I believe it's this one: https://support.bioconductor.org/p/105964/ I wasn't sure though if it's also fine for htseq-counts