Question

Rna-Seq Data Normalization With Spike-In Using Deseq

5

Entering edit mode

11.4 years ago

ivivek_ngs ★ 5.2k

Hi,

I need some inputs in normalizing the RNA-Seq data with spike-ins and using the DESeq to retrieve differentially expressed genes from the samples. I have a condition where I have 7 samples out of which 4 samples are of peripheries that give tumor and 4 are centers of tumor. I want to normalize the raw fragment counts(which you use in DESeq) with spike-in and then compute the DEGs from it. my samples data set looks like

head(m)
            Sample_118p.0 Sample_132p2.0 Sample_91p.0 Sample_118rz.0 Sample_132rz1.0 Sample_132rz2.0 Sample_91rz.0
XLOC_000001          1534           2603         1764           1057            2889            3830          1684
XLOC_000002           175            304          208            144             428             367           222
XLOC_000003            80            195          109            916            2515            2314          1082
XLOC_000004            49             66           54             51             127             219            94
XLOC_000005             0              0            0              0               0               0             0
XLOC_000006             0              1            0              0               0               0             0

spike-in data set

head(sp)
           Sample_118p.0 Sample_132p2.0 Sample_91p.0 Sample_118rz.0 Sample_132rz1.0 Sample_132rz2.0 Sample_91rz.0
ERCC-00009            49             66           54             51             127             219            94
ERCC-00025             9              7            6              5              14              21             8
ERCC-00031             0              0            0              0               1               1             0
ERCC-00034             1              3            2              0               6               6             4
ERCC-00035             5              7            7              9              32              38            21
ERCC-00042            43             78           56             73             202             199            98

I am using the spike ins sub category B which have equal concentrations so that the consistency is maintained

Now I want to use this in DESeq.

So what is the best possible way to implement this normalization on my RNA-Seq data and create the Newcountdata set object and then estimate size factors and then the dispersion (per-gene variance) to get the Differentially expressed genes from there. Does anybody have any idea about this? It will be good if anyone has used such scenarios can give me some idea about this problem?

differential-expression r • 13k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 11.4 years ago by ivivek_ngs ★ 5.2k

2

Entering edit mode

I'm assuming that you want to use the spike-ins simply for the size normalization, rather than estimating dispersion, correct? If so, you can actually manually set the size factors.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 11.4 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks , I have been able to normalize my RNA-Seq data with spike-ins and then used it to estimateDispersions to calculate the per gene variation and then use the negative binomial test to find the DEGs, but owing to the high complexity in my data set I cannot consider the result of DESeq as they fail the multiple testing correction and just on the basis of uncorrected p-val I don't see using those genes as there is another problem where I can see the read counts for some of my comparison is 0 so the mean is also 0 and hence the fold change despite of being statistically significant , it cant be considered. Have anyone of you faced such situations? I have already tried RankProd and Cuffdiff( not good results downstream). I am now trying DESeq, don't know what to do next.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 11.4 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

I'm not sure how highly complex your dataset is, it sounds fairly straight-forward. The presence of 0 counts isn't that uncommon, though you're most likely to see those when the counts overall are quite low, so they'll generally have crappy p-values. Without seeing enough of your data or any plots, it's rather difficult to give you any advice on how to proceed. In general, DESeq can deal with the 0-count scenario, but as you mentioned, the fold-change is not always the best metric to go by.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 11.4 years ago by Devon Ryan 105k

0

Entering edit mode

What have you tried so far? Where are you getting stuck in the analysis?

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 11.4 years ago by Josh Herr 5.8k

Ram · Answer 1 · 2014-07-29

For normalizing my RNA-seq with Sipke-In I did the Following:

Merged my reference FASTA and GTF files with Spike-In FASTA and GTF file then mapped my files with TopHat or STAR. Then:

TopHat -> Cufflinks -> Cuffmerge -> Cuffdiff

Then I divided my gene FPKMs with the sum of the total Spike-In FPKMs in the Cuffdiff output "gene_exp.diff" file for both the control and the treatment columns, let's call it finalized FPKMs. Then subtracted the finlized FPKM of the treatment from the control to see which lines are up or down regulated and then calculated a fold change for each line.

Now, I want to use the DEseq R package since I with cuffdiff I can do a DE analysis with finalized FPKMs. I am thinking of running DEseq in this way:

TopHat -> HTseq -> merge read_count_tables -> DEseq

But for normalizing by the Spike-In I run DEseq by the Spike-In Size factors and not the experiment size factors.

I think I am in the right path, if not, can anybody let me know why?

My codes were:

library(DESeq)
countsTable <- read.delim("merged_counts_SpikeIn.txt",header=TRUE)
rownames(countsTable) <- countsTable$gene
countsTable <- countsTable[,-1]
head( countsTable)
conds <- factor( c( "untreated", "untreated", "treated", "treated", "treated" ) )
spikeincds <- newCountDataSet( countsTable, conds )
spikeincds <- estimateSizeFactors( spikeincds )
sizeFactors( spikeincds )

countsTable <- read.delim("merged_counts.txt",header=TRUE)
rownames(countsTable) <- countsTable$gene
countsTable <- countsTable[,-1]
head( countsTable)
conds <- factor( c( "untreated", "untreated", "treated", "treated", "treated" ) )
cds <- newCountDataSet( countsTable, conds )
sizeFactors( cds ) = sizeFactors( spikeincds )
sizeFactors( cds )
head(counts(cds,normalized=TRUE))
cds <- estimateDispersions( cds )
res = nbinomTest( cds, "untreated", "treated" )
head(res)
write.table((res),file="outputFile_SikeIn_Normolized.txt",sep="\t")

Ram · Answer 2 · 2013-09-13

0

Entering edit mode

11.3 years ago

Jonathanjacobs ▴ 280

How are you normalizing your data to ERCC spike in data? normalize.loess?

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 11.3 years ago by Jonathanjacobs ▴ 280

0

Entering edit mode

normalize.loess is from the affy package, so it'd be pretty odd to use it with DESeq. The normal way to do this would be to use the spike-in data to calculate the size factor.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 11.3 years ago by Devon Ryan 105k

0

Entering edit mode

Apologies - I was thinking RPKM, not count based data.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 11.3 years ago by Jonathanjacobs ▴ 280

0

Entering edit mode

dpryan79 -- Could you elaborate on how you use spike-in data to calculate the size factor? I currently have four conditions (WT vs. EX1 vs. EX2 vs. EX3). What I have been trying is to create the classic MA plot with only the ERCC spike-ins. So, y-axis is log2(EX1/WT) and x-axis is 0.5*log2(EX1*WT). Then I use loess regression to fit the spike-in data and use that loess line to normalize the counts for the entire data set. Is there a more straightforward way to do this?

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 11.2 years ago by jep • 0

Ram · Answer 3 · 2013-09-14

Thanks I have already identified it and implemented the method. But I am having trouble in edgeR. I want to see my DEGs and compare it with DESEq. Now the manual seems to be a bit complez for me. I have a problem where I would like to understand how to use the same data set that I used in DESeq and then apply the edgeR normalizing the data with spike-ins and then test for DE in edgeR. I am not being able to formulate the script. Can any one help . I am giving some sample data sets below.

I have started using edgeR, I need some input about the method.

I have a data set with read counts from two different patients ( lets say it looks like the below)

head(m)
            Sample_118z.0 Sample_132z.0 Sample_118p.0 Sample_132p2.0
XLOC_000001           626          3516          1534           2603
XLOC_000002            82           342           175            304
XLOC_000003           361          2000            80            195
XLOC_000004            30           143            49             66
XLOC_000005             0             0             0              0
XLOC_000006             0             0             0              1

A sample of the data file

I also have the spike ins data which should be used to normalized this counts

head(sp)
           Sample_118z.0 Sample_132z.0 Sample_118p.0 Sample_132p2.0
ERCC-00009            30           143            49             66
ERCC-00025             2            13             9              7
ERCC-00031             0             0             0              0
ERCC-00034             1             9             1              3
ERCC-00035            11            35             5              7
ERCC-00042            37           186            43             78

I have used the DESeq earlier with the same experimental conditions and normalized the read counts with spike ins data and then ran the n.binom test for extract DEGs , but I cannot figure out how to proceed the same in case of the edgeR. Can anyone help me?