Entering edit mode
4.5 years ago
StartR
▴
30
Hi, I have downloaded the HT seq counts data (gene expression data) from TCGA for BRCA. I have files in *.rsem.genes.results format
when I run command:
txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)
It is giving me this error:
reading in files with read_tsv
1 Error in computeRsemGeneLevel(files, importer, geneIdCol, abundanceCol, :
all(c(geneIdCol, abundanceCol, lengthCol) %in% names(raw)) is not TRUE
In addition: Warning message:
Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two
when I read one file through read_tsv, it looks like:
> tmp <- read_tsv(files[1])
Parsed with column specification:
cols(
gene_id = col_character(),
raw_count = col_double(),
scaled_estimate = col_double(),
transcript_id = col_character()
)
> tmp
# A tibble: 20,531 x 4
gene_id raw_count scaled_estimate transcript_id
<chr> <dbl> <dbl> <chr>
1 ?|100130426 0 0 uc011lsn.1
2 ?|100133144 65.5 0.00000178 uc010unu.1,uc010uoa.1
3 ?|100134869 32.5 0.000000639 uc002bgz.2,uc002bic.2
4 ?|10357 504. 0.0000316 uc010zzl.1
5 ?|10431 5298 0.000150 uc001jiu.2,uc010qhg.1
6 ?|136542 0 0 uc011krn.1
7 ?|155060 564 0.00000604 uc003wfr.3,uc003wft.3,uc003wfu.2,uc011kup.1
8 ?|26823 1 0.0000000788 uc011mlh.1
9 ?|280660 0 0 uc010nib.1
10 ?|317712 0 0 uc010ihw.1
# ... with 20,521 more rows
Is there a way tximport will deal with this kind of TCGA data?
Indeed, tximport looks for these columns: https://github.com/mikelove/tximport/blob/master/R/tximport.R#L394-L406
yes, I agree, but I got to process this data I got from TCGA - BRCA (HTseq counts) project, to make a matrix of TPM.
there are genes in the tmp obj starting from row 30 :
Can we make a matrix from this kind of data through tximport? I need a TPM matrix from TCGA BRCA data set, and I can get it if I will get txi$abundance ..
...but, wait a second, you have HT-seq counts data, so, why were you trying to import it as RSEM? - they are different.
If you have HT-seq counts, then import them to
DESeq2 via DESeqDataSetFromMatrix()
, or take a look here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#htseqHi, thanks, it makes sense, but I do not want to detect differentially expressed genes - I want to convert a matrix with counts (either raw or normalized) to TPM matrix, So count matrix that contains all samples in columns and all genes in rows can be converted into TPM matrix?
Before we become completely confused here, which data have you downloaded? It has been implied that you have both HT-seq and RSEM data, but these are produced from different methods. Only the HT-seq files will contain a raw count, whereas RSEM files will contain an estimated count. If you literally point me to a file that you downloaded, I will be in a better position to assist.
Generally, though, to calculate TPM, you need to know the gene CDS length, which you currently do not have; however, it can be obtained in different ways.
yes, i am sorry and thank you for being patient with me! this data is also new for me, so I followed these steps to download the data from TCGA using ""TCGAbiolinks" package in R, although I have download the whole data set, I am making it simpler here, to download for a few samples, so that it will be easier for you to see this data and get a grip:
If you will run these commands, you will see the files are created in dir "BRCA_test", each folder contains files with extension *.rsem.genes.results
The file looks like the one I mentioned in my original query, containing gene_id, raw_counts, scaled_estimates, transcript_id.
Now, I am able to get a dds object from DESeq2 package for this data using:
I used "raw_counts" to generate "BRCA_mat"
Now my question is from raw count matrix, can I get TPM matrix.
I am aware that I will need featureLength, meanFragmentLength to calculate TPM - but given the data I get from TCGA, I do not have this data on length.
So is it possible to get TPM or even FPKM matrix from raw count martix?
Even if I will get FPKM, I will convert it to TPM. Thanks.
I see - thanks. You can calculate TPM from the BRCA_mat object; however, you will still require the gene lengths. The TPM calculation is elaborated ere: https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained.
DESeq2 does not produce TPM data. However, you could easily use DESeq2 to produce variance-stabilised or regularised log data, which are 'better' than TPM, in my opinion.
And even if i do this: dds <- estimateSizeFactors(dds) counts(dds, normalized=TRUE)
Will these normalized counts be equivalent to TPM?
I think not, This is just dividing each column of
counts(dds) by sizeFactors(dds)
but it will not normalize for gene length,
so I want TPM matrix from raw counts matrix. Please help. Thanks in advance.