Hello, I have gene count and sample attributes file from gtex v8 portal. I want to first extract brain tissue gene count data only by using the samID coming from sample attributes.txt file. Once, I have gene count coming from brain cortex tissue, I want to normallize it using deseq2 package. This is what I have done:
GTEx_Analysis_v8_Annotations_SampleAttributesDS <- read.delim("GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt")
sample_attributes <- select(GTEx_Analysis_v8_Annotations_SampleAttributesDS,SAMPID,SMTS,SMTSD,SMAFRZE)
sample_attributes_braindata <- sample_attributes %>% filter(sample_attributes$SMTS == "Brain" & sample_attributes$SMAFRZE == "RNASEQ")
sample_attributes_braindata_cortex = sample_attributes_braindata[sample_attributes_braindata$SMTSD == "Brain - Cortex",]
dim(sample_attributes_braindata_cortex)
library(tidyverse)
GTEx_Analysis_gene_reads <- read_table2("GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct")
GTEx_Analysis_gene_reads <- data.frame(GTEx_Analysis_gene_reads)
(GTEx_Analysis_gene_reads)[1:5,1:5]
dim(GTEx_Analysis_gene_reads)
gtex_gene_name = GTEx_Analysis_gene_reads$Name
gtex_gene_description = GTEx_Analysis_gene_reads$Description
sample_attributes_braindata_cortex$SAMPID = gsub(pattern = "-", replacement = ".", x = sample_attributes_braindata_cortex$SAMPID, fixed = TRUE)
index= colnames(GTEx_Analysis_gene_reads) %in% sample_attributes_braindata_cortex$SAMPID
sum(index)
GTEx_Analysis_gene_reads_cortex = GTEx_Analysis_gene_reads[,index]
rownames(GTEx_Analysis_gene_reads_cortex) = gtex_gene_name
#write.table(GTEx_Analysis_gene_reads_cortex, file = "GTEx_Analysis_gene_reads_cortex_unnorm.txt", sep="\t", quote=FALSE, col.names=NA)
#GTEx_Analysis_gene_reads_cortex = read.delim("GTEx_Analysis_gene_reads_cortex_unnorm.txt",row.names = 1)
sample_info<-data.frame(condition=colnames(GTEx_Analysis_gene_reads_cortex), row.names=colnames(GTEx_Analysis_gene_reads_cortex))
library(DESeq2)
ds<-DESeqDataSetFromMatrix(countData=GTEx_Analysis_gene_reads_cortex, colData=sample_info, design= ~condition)
keep_genes<-rowSums(counts(ds))>0
ds<-ds[keep_genes,]
ds<-estimateSizeFactors(ds)
normalized_counts<-counts(ds, normalized=TRUE)
write.table(normalized_counts,file="GTEx_Analysis_gene_reads_cortex_normalized.txt", sep= "\t", quote = F, row.names = F)
But the normalized gene count that I get from this is not normalized. There is some error in this. I don't have any condition in this case. I have used condtion here as colnames which represent the sampleID or GTEX-1117F. Should I use condition as gene names to normalize it? I am not familiar with GTEx data that much and need some guidance. Thank you.
Hello, I do include covariates later. At first I just want to normalize the data and then I generate covariates file using peer tool and then regress it out from normalized gene count to generate residual expression
So i should remove sample_info line and then just use this?
That would give you library-normalized counts; yes.