Trouble quantifying scRNAseq experiment
0
0
Entering edit mode
4.6 years ago

Hi,

I am very new at bioinformatics and have quite limited trainings. I am trying to look at a publically available data set (SRP182816, Drosophila testis single cell RNAseq from Witt et al 2019 eLife), but it seems like I can't quite get the correct quantification.

I downloaded the data from the SRA site. I then used all-transcript fasta file from flybase (dmel-all-transcript-r6.33.fasta.gz) to create an index with salmon. I followed up with quantification of the experiment with salmon alevin.

This is how I called salmon.

salmon alevin -l ISR -1 517_novogene_S1_L008_R1_001.fastq.gz -2 517_novogene_S1_L008_R2_001.fastq.gz --chromium -i Dmel_Ref_flybase_salmon/ -o SalmonAlevin_out --tgMap t2g.tsv

Salmon had mapping rate of 39.7% and gave 3001 cells (the authors sequenced 5000 cells). When I take a look at the PCA plot for the quantified gene by cell table, I had a horseshoe shaped PCA plot (quite literally look like a nike logo but a bit more symmetrical). I am under impression that this indicates something is wrong, but I cannot figure out what to do at this moment.

Here is the PCA plot: https://postimg.cc/hQs1C6Vh

The PCA plot is made with Seurat with the following steps.

library(Seurat)
library(tximport)
files <- file.path("SalmonAlevin_out_ensembl/alevin/quants_mat.gz")
txi <- tximport(files, type="alevin")
testis <- CreateSeuratObject(counts = txi$counts , min.cells = 3, min.features = 200, project = "testis scRNAseq")
testis <- NormalizeData(testis, normalization.method = "LogNormalize", scale.factor = 10000)
testis <- FindVariableFeatures(testis)
testis <- ScaleData(testis, features = all.genes)
testis <- RunPCA(testis, features = VariableFeatures(object = testis))
DimPlot(testis, reduction = "pca")

Can someone more knowledgeable please let me know what may cause this and how would I fix it?

Thank you, Chien

RNA-Seq • 1.7k views
ADD COMMENT
0
Entering edit mode

How did you create the PCA? Please show code and the image. Did you normalize data? If so how? Be aware that scRNA-seq is among the more difficult-to-analyse types of data. I suggest (unless you already do so) a guided tutorial such as the ones on the Seurat website.

ADD REPLY
0
Entering edit mode

Hi, thank you for the suggestion. I did follow the Seurat website and used their package to generate the PCA plot. I used default parameter for all the steps, but I did not add mt or rrna data and did not remove cells based on these. Here is the PCA plot: https://postimg.cc/hQs1C6Vh

I edited the post to add the codes.

ADD REPLY
0
Entering edit mode

Did you make sure the library is "ISR" ? Also are you using right annotations ? 10x data is 3' enriched, so if your GTF doesn't include 3' UTR coordinates, lot of data will be unquantifiable. Also Could you run UMAP and plot UMAP ?

testis <- FindNeighbors(testis , dims = 1:10)
testis <- FindClusters(testis , resolution = 0.5)
testis <- RunUMAP(testis , dims = 1:10)
DimPlot(testis , reduction = "umap")
ADD REPLY
0
Entering edit mode

OP uses salmon (alevin is the scRNA-seq submodule of it) to directly quantify against the transcriptome, so this should be ok. ISR is indeed recommended for 10x data in this context.

ADD REPLY
0
Entering edit mode

Ok. I meant to check if that the transcriptome was built using a right GTF.

ADD REPLY
0
Entering edit mode

Hi, for transcriptome, I used fasta file from flybase. The umap look like this:https://postimg.cc/RJZFdYQS

ADD REPLY
0
Entering edit mode

I suggest you switch from log-normalization to scTransform, also implemented in Seurat. It is a much more sophisticated normalization method.

ADD REPLY
0
Entering edit mode

The authors say in the Methods section that they used CellRanger to process their sequencing reads. Is there any particular reason (bseides speed) why you are using Salmon? As far as I know it's a standard practice to use CellRanger software when working with 10X data, and I wouldn't always expect Salmon to reproduce CellRanger results exactly

ADD REPLY
0
Entering edit mode

Hi, Thanks for pointing it out. I had tried CellRanger with dmel-all-r6.32.gtf and the dna.toplevel.fa from ensembl to make the reference. However, the quantification coming from CellRanger gave weird non-sensible results in my case (majority of the cells have a single feature with nCount_RNA of 1). It also gave me 50003 cells which is a lot more than the authors submitted for sequencing.

Here is the the code:

cellranger mkref --genome=Dmel_genome --fasta=Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa --genes=dmel-all-r6.32.gtf --memgb=32

and the count:

cellranger count --id=CellRanger_count --transcriptome=Dmel_genome --fastqs='path to fastq' --sample='SRR9705086'

I previously used fasterq-dump from SRA toolkit to download the fastq files, but I just found that the fastq files can also be accessed from https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP182816. Is there any difference comparing the two method of downloading the fastq files?

Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 2476 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6