How to properly analyze many RDS files in Bioconductor
1
0
Entering edit mode
4.3 years ago
arjuncsubra ▴ 20

Link to the previous post for context: 0 Features for COVID-19 ExpressionSet through GEOQuery

Hi all,

I am analyzing a COVID-19 dataset (GSE150728) and my aim for this dataset is to create clustering maps.

For this dataset, I was instructed to download the RDS files present in the GSE150728_RAW.tar file from the GEO website in the previous post (instead of using GEOquery).

After doing this (and untaring the files), I decided to bind these separate files to create a single object for my later analyses.

library(Biobase)
library(SingleCellExperiment)
library(raster)

files <- list.files(path = "C:/Users/arjun/GSE150728", pattern = "\\.rds$", full.names = TRUE) #stacks all the .rds files in working folder
stack <- do.call("rbind", lapply(files, readRDS))

I decided to convert the stack object into a SingleCellExperiment object and tried running T-SNE analyses on the new SCE.

library(SingleCellExperiment)
library(tidyverse)
library(Seurat)

sce <- SingleCellExperiment(stack)

library(scran)
library(scRNAseq)
library(DropletUtils)
library(Matrix)

RunTSNE(sce)

However, I am met with the following error:

00:01:59 UMAP embedding parameters a = 0.9922 b = 1.112
00:01:59 Read 13 rows and found 3 numeric columns
Error in uwot(X = X, n_neighbors = n_neighbors, n_components = n_components,  : 
  n_neighbors must be smaller than the dataset size

After this setback, I decided to examine the contents of one of the .rds files in my working folder (using readRDS) and here is what I got:

$exon
34463 x 17057 sparse Matrix of class "dgCMatrix"
   [[ suppressing 66 column names ‘TTGCTAAGCAGT’, ‘AACGACGGGTCT’, ‘AACGGGGGCGAG’ ... ]]
   [[ suppressing 66 column names ‘TTGCTAAGCAGT’, ‘AACGACGGGTCT’, ‘AACGGGGGCGAG’ ... ]]

KLK11          1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
CCDC159        1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
C7orf50        1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RP11-1437A8.4  1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
EMP2           1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
IFI27L2        1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . ......
ANTXR1         1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RP11-586K12.10 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

 ..............................
 ........suppressing 16991 columns and 34448 rows in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................
   [[ suppressing 66 column names ‘TTGCTAAGCAGT’, ‘AACGACGGGTCT’, ‘AACGGGGGCGAG’ ... ]]

RNU6-1128P    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
SNORD116-23   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RNU4-38P      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RP11-527L4.2  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RP11-49K24.4  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RP11-134J21.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RP11-344P13.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

$intron
23682 x 17057 sparse Matrix of class "dgCMatrix"
   [[ suppressing 66 column names ‘TTGCTAAGCAGT’, ‘AACGACGGGTCT’, ‘AACGGGGGCGAG’ ... ]]
   [[ suppressing 66 column names ‘TTGCTAAGCAGT’, ‘AACGACGGGTCT’, ‘AACGGGGGCGAG’ ... ]]

RP11-505K9.4 1 . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RP11-718G2.5 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
C3orf22      1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
PIGZ         1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
C19orf18     1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . ......
SLC7A9       1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
C19orf57     1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RP11-45K10.2 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

 ..............................
 ........suppressing 16991 columns and 23667 rows in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................
   [[ suppressing 66 column names ‘TTGCTAAGCAGT’, ‘AACGACGGGTCT’, ‘AACGGGGGCGAG’ ... ]]

ABHD11       . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RGL2         . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
EMD          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RP11-37E23.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
CTC-493L21.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
COX6A2       . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
AL928768.3   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

$spanning
20176 x 17057 sparse Matrix of class "dgCMatrix"
   [[ suppressing 66 column names ‘TTGCTAAGCAGT’, ‘AACGACGGGTCT’, ‘AACGGGGGCGAG’ ... ]]
   [[ suppressing 66 column names ‘TTGCTAAGCAGT’, ‘AACGACGGGTCT’, ‘AACGGGGGCGAG’ ... ]]

STX4    1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
GPAM    1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
RPL3    1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
NDUFAF1 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
MAPK1   1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
CYP24A1 . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
NUDT2   . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
C1orf52 . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

 ..............................
 ........suppressing 16991 columns and 20161 rows in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................
   [[ suppressing 66 column names ‘TTGCTAAGCAGT’, ‘AACGACGGGTCT’, ‘AACGGGGGCGAG’ ... ]]

ZNF784       . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
SPATA19      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
KIAA0101     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
WBP5         . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
PTCHD3P1     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
TMED2        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
CTD-2589M5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

What is going wrong here in my process? Also, could someone explain what exactly is in these .rds files and how they relate to creating clustering maps?

RNA-Seq Bioconductor R rna-seq • 3.6k views
ADD COMMENT
1
Entering edit mode

For standard scRNA-seq analysis, splitting intro intron, exon, and spanning is probably not necessary. You can combine the three matrices into one count matrix. Also, you shouldn't be row-binding the matrices for different samples. I'm not familiar with scran, but you likely want to make separate single cell experiments, and then merge the experiments together.

ADD REPLY
1
Entering edit mode

I would not combine them. I would subset specifically for the exonic reads.

stack <- do.call("rbind", lapply(files, function(x){ readRDS(x)$exon}))
ADD REPLY
1
Entering edit mode

It's been show that intronic reads in scRNA-seq predominantly come from nascent RNA, and thus represent legitimately sequenced transcripts. If you exclude introns you could be masking certain interesting phenomenon, such as increasing/decreasing transcription rate for a gene, or cell types with increased intron retention. See the literature for RNA velocity for more information.

ADD REPLY
1
Entering edit mode

This is not correct. You should only include intronic reads if you are interested in them. Counting the intronic reads will inflate your estimates of "steady-state" mRNA - which is the focus of most transcriptomic studies.

ADD REPLY
0
Entering edit mode

I have tried your command and it results in the following error message:

Error: Matrices must have same number of columns in rbind2(argl[[i]], r)

I also tried converting the first .rds object into a SingleCellExperiment using SingleCellExperiment(). I get the following error message:

Error in method(object) : all assays must have the same nrow and ncol

What is causing this? Are the matrices uneven, and if so, how do I fix it practically? Also, how do I combine the three count matrices to make one?

ADD REPLY
1
Entering edit mode
4.3 years ago

You cannot merge these files in this way. The RDS files in this case do not include equal numbers of rows or columns.

If you want to merge all the files you will need to cbind them after trimming them down to all have the same row.names.

I assume you want a giant matrix with genes as rows and cells as columns.

I suggest you do this:

## merging all exonic reads into a gene x cell matrix
goodnames <- Reduce(intersect,lapply(files, function(x) row.names(readRDS(x)$exon)))
stack <- do.call("cbind", lapply(files, function(x) readRDS(x)$exon[goodnames,]))

If you really want junction spanning reads you could the code below this - but I suggest you use the code above for just the exons because in the below snippet you lose many genes.

## merging all exonic reads (including junction mapping reads) into a gene x cell matrix
## I DO NOT RECOMMEND THIS as you lose 25% of your genes
goodnames <- Reduce(intersect,lapply(files, function(x){
  x <- readRDS(x)
  x <- intersect(row.names(x$exon),row.names(x$spanning))
  return(x)
}))

stack <- do.call("cbind", lapply(files, function(x){
  x <- readRDS(x)
  x <- x$exon[goodnames,] + x$spanning[goodnames,]
  return(x)
}))
ADD COMMENT
0
Entering edit mode

It works, thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 2459 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6