Hi all,
I'm trying to run a differential gene expression analysis where I want the report to contain ensemble_transcript_id as well.
direc<- 'E:/counts_sep/MSC/'
countFile<-paste(direc,'htseq.all.counts.txt',sep='')
print("Loading count data")
counts <- read.delim(file=countFile,header=TRUE,sep="\t")
dim(counts)
mart = useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
#get the ensemble id you want to match
gene_anno_filtered = rownames(counts)
#now get the mart
dat = getBM(
values = gene_anno_filtered,
filters = c("ensembl_gene_id"),
attributes = c("ensembl_gene_id", "external_gene_name", "description", "ensembl_transcript_id", "ensembl_peptide_id"),
mart = mart
)
#now match the dat and your data
idx_gene = match(rownames(counts), dat$ensembl_gene_id)
counts$external_gene_name=dat$external_gene_name[idx_gene]
counts$Description=dat$description[idx_gene]
counts$ensembl_transcript_id =dat$ensembl_transcript_id[idx_gene]
counts$ensembl_peptide_id =dat$ensembl_peptide_id[idx_gene]
#getting data out
geneData <- data.frame(counts[,1:5])
print ("Gene data from counts table")
colnames(geneData)[4] <- "width"
print(head(geneData))
#OUTPUT EXAMPLE
# Chr Start End width Strand
#ENSMUSG00000000001 chr3 108107280 108146146 3262 -
#ENSMUSG00000000003 chrX 77837901 77853623 1599 -
#ENSMUSG00000000028 chr16 18780447 18811987 4722 -
#ENSMUSG00000000031 chr7 142575529 142578143 6343 -
#ENSMUSG00000000037 chrX 161117193 161258213 23080 +
#ENSMUSG00000000049 chr11 108343354 108414396 2662 +
countData<-data.frame(counts[,6:length(colnames(counts))])
print ("Counts data from counts table")
# Require at least 25% of samples to have count > 25
quant <- apply(countData,1,quantile,0.75)
keep <- which((quant >= 4) == 1)
countData <- countData[keep,]
dim(countData)
method <-'htseq'
the line:
quant <- apply(countData,1,quantile,0.75)
gives the following error:
Error in quantile.default(newX[, i], ...) :
missing values and NaN's not allowed if 'na.rm' is FALSE
When I execute this line :
countData<-data.frame(counts[,6:length(colnames(counts))])
and check >countData, it shows several attributes are NA.
MSC329_RU486 MSC334_CTR MSC336_TPA external_gene_name Description ensembl_transcript_id ensembl_peptide_id
ENSMUSG00000000001.4 3889 5379 5080 NA NA NA NA
ENSMUSG00000000003.15 0 0 0 NA NA NA NA
ENSMUSG00000000028.14 79 67 50 NA NA NA NA
ENSMUSG00000000031.15 17 0 14 NA NA NA NA
ENSMUSG00000000037.16 18 10 8 NA NA NA NA
Can someone please help me fix this?
By the looks of your code, I suspect you are trying to adapt an edgeR tutorial. You should store your data on a DGEList object, which is a list-based class with different slots for each king of data. This way, the counts are stored separately from the metadata, and it is easier to manipulate both:
Hi h.mon, thanks so much for the feedback.
the line :
returns:
Error in .isAllZero(counts) : counts must be positive finite values
is the error because I have nonnumerical rows and columns values? I'll do some more research and try to troubleshoot this. Thanks again!
WARNING
Please be aware that you are arbitrarily adding a transcript_id to each of your genes. Each gene can have multiple transcripts associated with it and in your case you are simply appending the transcript_id associated with the first occurrence of the gene_id in your biomaRt dataframe using match. Only for genes with a single transcript can you be sure that your association is correct.
Hi benformatic,
thanks for pointing this out. This basically voids everything that I'm trying to do and in light of this information, I should look for a way to run differential expression of 'transcripts' instead of genes. I'm not sure how to do that though. I was hoping that my output would contain multiple entries of the same gene matched with all the transcript ids.
I understand this is an old post, but want to point out for any current readers that Salmon is a transcript aware aligner that accomplishes this: https://salmon.readthedocs.io/en/latest/salmon.html