Hello All,
The RNA velocity pre-built index is it only compatible with 10 X? I have tried to do kb count using pre-built index for dropseq but I have noticed that the same number of barcodes (counts_unfiltered) align with spliced and unspliced regions, which are more unlikely.
Table 1: Number of barcodes aligned with spliced and unspliced region
Samples||Technique||Barcodes_Spliced||Barcodes_Unspliced
Sample1||Dropseq||976||976
Sample2||Dropseq||1638||1638
I have created an index again with the recent human genome build using kb ref to rectify this. I have used 500 GB and increased to 1TB memory of my cluster. But I have seen the output index files are in kb size? I don’t understand why it is so
Please help me!
Kb count command using Prebuilt index:
kb count \
--h5ad \
-i /home/ref_index/custom_index/prebuilt_index.idx -g /home/ref_index/custom_index/t2g.txt \
-x DROPSEQ \
-o /home/count/lammanno/$val \
-c1 /home/ref_index/custom_index/spliced.txt \
-c2 /home/ref_index/custom_index/unspliced.txt \
--workflow=lamanno $(find -type f -name '*.fastq.gz'| sort)
Kb ref command for human genome
kb ref \
-i ref_custom_index.idx \
-g t2g_custom.txt \
-f1 cdna_custom.fa \
-f2 intron_custom.fa \
-c1 spliced_custom.txt \
-c2 unspliced_custom.txt \
--workflow=lamanno \
-n 8 \
/home/ref_index/Homo_sapiens.GRCh38.dna.toplevel.fa.gz /home/ref_index/GRch38.gtf.gz
Thanks for your reply.
If I remove the -n 8 flag, I am also I am getting the same issue. The output index file size is very small.
For 10X data, I have 6 fastq files, so to acquire fastq, i have written like that automatically. But for my dropseq files, if i gave fastq files as you have suggested now.
The number of barcodes --> Kallisto bus tools (Kb count) provides counts unfiltered folder in that we can see spliced_barcodes.txt and unspliced_barcodes.txt. In that, if i count and see the barcodes, both are the same. Also, If read spliced.mtx and unspliced.mtx through MMread and write the shape of the matrix, both are the same. But in the case of 10X, i am getting more barcodes in un spliced_barcodes.txt. Is it right? I am confused
Thanks, Akila
How large is your ref_custom_index.idx compared to prebuilt_index.idx? The command seems to work on my end... Can you send me the download links to the fa.gz and gtf.gz files so I can diagnose? The idx file should be several gigabytes. Maybe also show me what happens when you run
kallisto inspect ref_custom_index.idx
As for your barcodes, dropseq is different when it comes to barcodes (i.e. there's no predefined whitelist). In 10x, a whitelist is used and maybe some barcodes in the whitelist didn't have any spliced reads mapped. In dropseq, the barcodes are inferred from the reads themselves (i.e. the most common sequences that appear in the fastq files are used as barcodes) so it's less likely that there'd be a discrepancy between between spliced vs. unspliced.
That's my guess but again, please load your matrices into seurat/scanpy and QC it from there.
Thanks for your help.
I have downloaded my GTF file again and rerun the kbref command. Now index file is ~43GB. I have used the following version
curl -O ftp://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/ Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
curl -O ftp://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/ Homo_sapiens.GRCh38.105.gtf.gz
I am a new bee to single-cell analysis, I have some more doubts. Please clarify
Using this index, i have quantified using kb count and noticed in the output file
adata.h5ad file is the intersection of spliced and unspliced counts. Is it right?
adata.h5ad doesn't contain layers for spliced and unspliced counts. Do I need to create from spliced.mtx and unspliced.mtx file?
adata=anndata.AnnData(“/home/sample1/counts_unfilered/adata.h5ad”)
adata.layers[“spliced”]=mmread(“/home/sample1/counts_unfiltered/spliced.mtx”)
adata.layers[“unspliced”]=mmread(“/home/sample1/counts_unfiltered/unspliced.mtx”)
but these matrix shapes are different so that I am unable to merge the matrix as layers
spliced matrix shape:(265225, 61541) unspliced matrix shape:(235018, 61541) Intersection(adata.h5ad) shape:(151978, 61541)
Please help me to clarify!!!
Thanks Akila
The adata file produced by kb count should already contain layers for spliced and unspliced if you ran the --workflow lamanno.
In any case, you can use the function
import_matrix_as_anndata(matrix_path, barcodes_path, genes_path)
to create your adata.E.g.: