Question

RNA velocity similar mtx for spliced and unspliced region (dropseq)

0

Entering edit mode

3.5 years ago

akilabioinfo ▴ 10

Hello All,

The RNA velocity pre-built index is it only compatible with 10 X? I have tried to do kb count using pre-built index for dropseq but I have noticed that the same number of barcodes (counts_unfiltered) align with spliced and unspliced regions, which are more unlikely.

Table 1: Number of barcodes aligned with spliced and unspliced region

Samples||Technique||Barcodes_Spliced||Barcodes_Unspliced
Sample1||Dropseq||976||976
Sample2||Dropseq||1638||1638

I have created an index again with the recent human genome build using kb ref to rectify this. I have used 500 GB and increased to 1TB memory of my cluster. But I have seen the output index files are in kb size? I don’t understand why it is so

Please help me!

Kb count command using Prebuilt index:

kb count \
  --h5ad \
  -i /home/ref_index/custom_index/prebuilt_index.idx -g /home/ref_index/custom_index/t2g.txt \
  -x DROPSEQ \
  -o /home/count/lammanno/$val \
  -c1 /home/ref_index/custom_index/spliced.txt \
  -c2 /home/ref_index/custom_index/unspliced.txt \
  --workflow=lamanno $(find -type f -name '*.fastq.gz'| sort)

Kb ref command for human genome

kb ref \
  -i ref_custom_index.idx \
  -g t2g_custom.txt \
  -f1 cdna_custom.fa \
  -f2 intron_custom.fa \
  -c1 spliced_custom.txt \
  -c2 unspliced_custom.txt \
  --workflow=lamanno \
  -n 8 \
  /home/ref_index/Homo_sapiens.GRCh38.dna.toplevel.fa.gz /home/ref_index/GRch38.gtf.gz

Dropseq RNAvelocity Kallisto-bustools • 2.1k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 3.5 years ago by akilabioinfo ▴ 10

score 1 · Answer 1 · 2022-01-28

1

Entering edit mode

3.5 years ago

dsull ★ 7.6k

The pre-built index should be compatible with any technology. There's nothing technology-specific about the index.

For kb ref, remove the -n 8 (this feature has a few issues).

Also, I don't recommend using $(find -type f -name '*.fastq.gz'| sort) -- the fastq files need to be supplied in a certain order (for dropseq, the first file contains the barcode+UMI and the second file contains the biological read).

How are you getting the "number of barcodes aligned with spliced and unspliced region"? To be honest, I'm not exactly sure what you mean by that. Are you looking at a particular gene?

Finally, I need more QC information -- how many reads are mapped? Also, load the matrix into Seurat/Scanpy and do some preliminary QC analysis. Also, run kb with --verbose so I can get some debugging information.

ADD COMMENT • link 3.5 years ago by dsull ★ 7.6k

0

Entering edit mode

Thanks for your reply.

If I remove the -n 8 flag, I am also I am getting the same issue. The output index file size is very small.
For 10X data, I have 6 fastq files, so to acquire fastq, i have written like that automatically. But for my dropseq files, if i gave fastq files as you have suggested now.

The number of barcodes --> Kallisto bus tools (Kb count) provides counts unfiltered folder in that we can see spliced_barcodes.txt and unspliced_barcodes.txt. In that, if i count and see the barcodes, both are the same. Also, If read spliced.mtx and unspliced.mtx through MMread and write the shape of the matrix, both are the same. But in the case of 10X, i am getting more barcodes in un spliced_barcodes.txt. Is it right? I am confused

Thanks, Akila

Please check the output of kb count files

ADD REPLY • link 3.5 years ago by akilabioinfo ▴ 10

0

Entering edit mode

How large is your ref_custom_index.idx compared to prebuilt_index.idx? The command seems to work on my end... Can you send me the download links to the fa.gz and gtf.gz files so I can diagnose? The idx file should be several gigabytes. Maybe also show me what happens when you run kallisto inspect ref_custom_index.idx

As for your barcodes, dropseq is different when it comes to barcodes (i.e. there's no predefined whitelist). In 10x, a whitelist is used and maybe some barcodes in the whitelist didn't have any spliced reads mapped. In dropseq, the barcodes are inferred from the reads themselves (i.e. the most common sequences that appear in the fastq files are used as barcodes) so it's less likely that there'd be a discrepancy between between spliced vs. unspliced.

That's my guess but again, please load your matrices into seurat/scanpy and QC it from there.

ADD REPLY • link 3.5 years ago by dsull ★ 7.6k

0

Entering edit mode

Thanks for your help.

I have downloaded my GTF file again and rerun the kbref command. Now index file is ~43GB. I have used the following version

curl -O ftp://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/ Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

curl -O ftp://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/ Homo_sapiens.GRCh38.105.gtf.gz

I am a new bee to single-cell analysis, I have some more doubts. Please clarify

Using this index, i have quantified using kb count and noticed in the output file

adata.h5ad file is the intersection of spliced and unspliced counts. Is it right?
adata.h5ad doesn't contain layers for spliced and unspliced counts. Do I need to create from spliced.mtx and unspliced.mtx file?

adata=anndata.AnnData(“/home/sample1/counts_unfilered/adata.h5ad”)

adata.layers[“spliced”]=mmread(“/home/sample1/counts_unfiltered/spliced.mtx”) adata.layers[“unspliced”]=mmread(“/home/sample1/counts_unfiltered/unspliced.mtx”)

but these matrix shapes are different so that I am unable to merge the matrix as layers

spliced matrix shape:(265225, 61541) unspliced matrix shape:(235018, 61541) Intersection(adata.h5ad) shape:(151978, 61541)

spliced.mtx is similar to count matrix what we obtain if we use transcriptome.idx? So can I proceed with preprocessing, clustering and DEG with this count file? But how will I do RNA velocity without unspliced counts?
Or else should I stick with the intersection count matrix of spliced and unspliced data for DEG and proceed with RNA velocity analysis. If so, Then how to merge splice and unspliced layers into adata?

Please help me to clarify!!!

Thanks Akila

ADD REPLY • link 3.5 years ago by akilabioinfo ▴ 10

0

Entering edit mode

The adata file produced by kb count should already contain layers for spliced and unspliced if you ran the --workflow lamanno.

In any case, you can use the function import_matrix_as_anndata(matrix_path, barcodes_path, genes_path) to create your adata.

E.g.:

adata = import_matrix_as_anndata(spliced_matrix_path, spliced_barcodes_path, spliced_genes_path)
adata.layers["spliced"] = adata.X
ndata = import_matrix_as_anndata(unspliced_matrix_path, unspliced_barcodes_path, unspliced_genes_path)
adata.layers["unspliced"] = ndata.X

ADD REPLY • link 3.5 years ago by dsull ★ 7.6k