Hi! My name is Rafa and I am a beginer in the world of scRNA-seq. I've been looking at workflows like https://scrnaseq-course.cog.sanger.ac.uk/website/index.html or https://broadinstitute.github.io/2019_scWorkshop/index.html#course-overview and I do not understand the creation of the SCE object/Starsolo alignment.
I'm using the https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/neurons_900 dataset for practice as it doesn't take up much memory and to make wait times shorter. I'm analyzing with the "Starsolo" program, using the following code:
STAR --genomeDir /home/victor/Escritorio/Curso_Single_Cell/indices/STAR --runThreadN 16 --readFilesIn neurons_900_fastqs/neurons_900_S1_L001_R2_001.fastq,neurons_900_fastqs/neurons_900_S1_L002_R2_001.fastq neurons_900_fastqs/neurons_900_S1_L001_R1_001.fastq,neurons_900_fastqs/neurons_900_S1_L002_R1_001.fastq --soloType CB_UMI_Simple --soloCBwhitelist /home/victor/Escritorio/Curso_Single_Cell/whitelist/737K-august-2016.txt --outFileNamePrefix results/STAR/
After that, Starsolo return a raw and filtered data, where you can find the matrix, barcodes and genes/features. But when I load this 3 files and create a SCE object, the count of assays are not correct.
> dir.name <- "/home/victor/Escritorio/Curso_Single_Cell/results/STAR/Solo.out/Gene/raw"
> list.filesdir.name)
[1] "barcodes.tsv" "genes.tsv" "matrix.mtx"
> sce <- DropletUtils::read10xCountsdir.name, col.names = TRUE)
> sce
class: SingleCellExperiment
dim: 55487 737280
metadata(1): Samples
assays(1): counts
rownames(55487): ENSMUSG00000102693 ENSMUSG00000064842 ... ENSMUSG00000096730 ENSMUSG00000095742
rowData names(3): ID Symbol NA
colnames(737280): AAACCTGAGAAACCAT AAACCTGAGAAACCGC ... TTTGTCATCTTTAGTC TTTGTCATCTTTCCTC
colData names(2): Sample Barcode
reducedDimNames(0):
spikeNames(0):
altExpNames(0):
> summary(assay(sce, "counts"))
55487 x 737280 sparse Matrix of class "dgCMatrix", with 5113008 entries
i j x
1 2681 1 1
2 26019 1 1
3 30593 1 1
4 30624 1 1
5 30756 1 1
6 36144 1 1
7 38875 1 1
8 53732 1 1
9 46321 3 1
10 55399 5 1
11 4333 6 1
12 7768 6 1
13 10051 6 1
14 15470 6 1
15 25255 6 1
16 32249 6 1
17 33914 6 1
18 37100 6 1
19 40026 6 1
20 40180 6 1
21 41019 6 1
22 49661 6 1
23 49669 6 1
24 18081 7 1
25 16776 9 1
26 54018 11 1
27 272 12 1
28 9832 12 1
29 13560 12 1
30 14856 12 1
31 15490 12 1
32 18592 12 1
33 23950 12 1
34 25910 12 1
35 28138 12 1
36 28177 12 1
37 35881 12 1
38 36144 12 1
39 36692 12 1
40 37663 12 1
41 38459 12 1
42 39978 12 1
43 40156 12 1
44 41019 12 1
45 41030 12 1
46 43773 12 1
47 46411 12 2
48 48427 12 1
49 49388 12 1
50 49409 12 1
51 49414 12 2
52 50650 12 1
53 33914 14 1
... etc
I don't know why is happening this. Maybe it could be because I need to count the reads per gene? I thought that Starsolo perform the mapping but also the counting. If it this the reason, what should I do?
Thanks a lot!! :)
And which rownames should be ??
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.SUBMIT ANSWER
is for new answers to original question