Greetings!, Hope everyone is doing well!
I have a question regarding duplicate ensemble ids of RNA-Seq data. I am using DESeq2 to analyze raw counts from a dataset from the GEO database. I have imported the dataset using read.table and not tximport.
From my simple understanding of rna-seq workflow, to prepare the data for DESeq2(DESeqDataSetFromMatrix
), the row names of count data should be the identifiers of gene/transcripts(e.g. gene name or ensemble gene id). However, when I try to make the ensemble ids as row names, like this:
rownames(data_sharna)<-data_sharna$gene_id
I get the following error (Error in rowNamesDF<-(x, value = value) : duplicate 'row.names are not allowed
).
and when I check for duplicates
sum(duplicated(data_sharna$gene_id))
I get that there are 30 duplicates in my gene_id(ensemble id).
I went on and removed duplicates
data_sharna<-data_sharna[!duplicated(data_sharna$gene_id),]
But now my question is: is it correct to do what I have done? The data I am using is from the GEO database and when I go to the description of how they have prepared the raw counts, I read the following :
Raw sequencing data was demultiplexed by bcl2fastq v.2.20 Raw reads obtained from RNA-Seq were aligned to the transcriptome using STAR (version 2.5.0) (Dobin A et al., 2013) / RSEM (version 1.2.25) (Li B and Dewey CN, 2011) with default parameters, using a custom human GRCh38 transcriptome reference downloaded from http://www.gencodegenes.org, containing all protein coding and long non-coding RNA genes based on human GENCODE version 33 annotation.
So, based on this description, I understood that it's not gene counts what is provided since it was aligned to the transcriptome, but transcript counts instead. Therefore, I should use tximport to import the data. Is that correct?
Thank you very much in advance for your help! Best, Ridha