I'm running DESeq2 on small RNA sequencing data. I constructed a .csv containing the raw count of each unique small RNA sequence across all my datasets. In the past I've successfully used DESeq2 on similar data, but this time my file size is bigger: my CSV is >2Gb. I'm running into memory errors using x64-bit RStudio on a Windows machine with 64Gb RAM.
This is all I'm trying to do right now:
library(DESeq2)
sRNA <- read.csv("deseq_input_all.txt", header=T, row.names=1)
coldata <- read.csv("deseq2/coldata.csv", header=T, row.names=1)
dds <- DESeqDataSetFromMatrix(countData = sRNA, colData = coldata, design = ~ group)
dds <- DESeq(dds)
However, at the DESeq() stage RStudio maxes out the memory and stops. Am I doing something silly with the code above, or is my data too big for this analysis? Is it worthwhile running it on a Linux machine or running R separate from RStudio?
Any tips or advice is appreciated.
EDIT: There are 34 million rows of data.
dim(sRNA)
> [1] 34760467 21
I'd tend to agree with @Carlo Yague's first point... 2GB of raw counts seems odd to me. Can you show the output of
dim(sRNA)
Unless you have thousands of samples, the counts table should not be 2 GB.
There are 34 million rows (unique sequences) in the count table.
I have a feeling that you "counted" unique reads in a fastq file. That's not going to be useful for you. Align those to a genome, generate counts with featureCounts or htseq-count on the resulting alignments and then use the counts from that. You'll suddenly find that you only have a few tens of thousands of rows, which makes rather more biological sense.
Exactly. Just to clarify, the counts table should have samples as columns and genes as rows, so 20k-50k rows depending on annotation (for human/mouse).
Correct. For smallRNAs this number of "genes" might be a bit different, but that's the gist. BTW, if you're mostly interested in a single type of small RNA then there are dedicated programs for most of them (e.g., mirDeep).