Kallisto Bustools Single Cell Velocity Issue: capture and count not working as expected
1
0
Entering edit mode
4.1 years ago
ericrkofman ▴ 20

Hi,

I have been walking through the tutorials at https://www.kallistobus.tools/velocity_tutorial.html and https://www.kallistobus.tools/velocity_index_tutorial.html, and have run into a few issues including new versions of bustools using slightly altered commands than what the tutorial specifies.

Even after untangling some of these issues and getting to a point where I think this should be working, I'm still running into trouble with actually getting count to output the data I would expect.

For reference, I first ran:

kallisto bus -i ../velocity_index/cDNA_introns.idx -o bus_output_06/ -x 10xv2 -t 4 TYR4/TYR4_0_1_HC3HMDSXY/bamtofastq_S1_L004_R1_001.fastq.gz TYR4/TYR4_0_1_HC3HMDSXY/bamtofastq_S1_L004_R2_001.fastq.gz

This produces the output.bus, matrix.ec and transcripts.txt files in the bus_output_06 folders. Then, we have to run

bustools correct -w ../10xv2_whitelist.txt -p output.bus | bustools sort -o output.correct.sort.bus -t 4 -

This yields the output.correct.sort.bus files. Using that, then I run:

bustools capture -s -o cDNA_capture.bus -c cDNA_transcripts.to_capture.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

bustools capture -s -o introns_capture.bus -c introns_transcripts.to_capture.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

Followed by the count step:

bustools count -o u -g cDNA_introns_t2g.txt -e matrix.ec -t transcripts.txt --genecounts cDNA_capture.bus 
bustools count -o s -g cDNA_introns_t2g.txt -e matrix.ec -t transcripts.txt --genecounts introns_capture.bus

But I get empty outputs:

-rw-r--r-- 1 ekofman ---- 607K Oct  7 07:51 s.barcodes.txt
-rw-r--r-- 1 ekofman ----    0 Oct  7 07:51 s.genes.txt
-rw-r--r-- 1 ekofman ----  114 Oct  7 07:51 s.mtx
drwxr-xr-x 2 ekofman ----    2 Oct  6 11:33 tmp
-rw-r--r-- 1 ekofman ----  52M Oct  5 17:01 transcripts.txt
-rw-r--r-- 1 ekofman ---- 618K Oct 19 12:25 u.barcodes.txt
-rw-r--r-- 1 ekofman ----    0 Oct 19 12:25 u.genes.txt
-rw-r--r-- 1 ekofman ----  114 Oct 19 12:25 u.mtx

So! What could be going wrong? My first thought was maybe the transcripts are mismatched, from the velocity_index_tutorial page.

So let's check:

My cDNA_introns_t2g.txt file looks like this, with 2244276 lines:¶

ENST00000237247 ENSG00000118473 SGIP1

ENST00000237247 ENSG00000118473 SGIP1

ENST00000237247 ENSG00000118473 SGIP1

My matrix.ec looks like this, with 12216809 lines:¶

0 0

1 1

2 2

3 3

4 4

My transcripts.txt file looks like this:

ENST00000237247.1

ENST00000237247.2

ENST00000237247.3

ENST00000237247.4

My cDNA_capture.bus (converted to text, 1663579 lines) looks like this:

AAACCTGAGTTAAGTG CCGAGCTCAA 2467401 1

AAACCTGAGTTAAGTG CGCTTTAGTA 11871347 1

AAACCTGAGTTAAGTG CGGCACGAGG 11781147 1

AAACCTGAGTTAAGTG CGGGTTTCAC 590261 1

I honestly have no clue what is wrong. I intended to use transcripts without version IDs, and I think that the increasing number after the decimal place is supposed to be there, added during the index-making stage as specified by the tutorial. I feel like I'm very close to breaking through on this and finally getting results, but the tutorial being out of date and hard to debug makes it tough! It would be great to hear from anybody else who maybe has dealt with similar issues and might know what to do about this. Thanks!

kallisto bustools single-cell velocity rna • 2.2k views
ADD COMMENT
1
Entering edit mode

Hmm, something's off with your cDNA_introns_t2g.txt file.

If you look at the example cDNA_introns.t2g.txt.gz file supplied in the tutorial, it is formatted with version numbers AND has special designations for intronic regions (e.g. ENST00000417324.21-I).

Can you show how you're building the index?

I'd recommend following the notebook tutorials here: https://www.kallistobus.tools/tutorials

ADD REPLY
0
Entering edit mode

@dsull sorry for the delay

I actually followed along exactly with the index-building tutorial, except that I had to alter a few steps:

I actually had to use

./t2g.py --use_version < gencode.v19.annotation.gtf > tr2g.txt

I also had to change the introns command...

cat introns.fa | awk '/^>/ {print $0}' | tr "_" " " | awk '{print $1}' | cut -c2- > introns_transcripts.txt
cat introns_transcripts.txt | awk '{print $0"."NR"-I"}' > introns_transcripts.to_capture.txt

also this command was different (re-headering the cDNA fasta)

awk -v var=1 'FNR==NR{a[NR]=$0;next}{ if ($0~/^>/) {print a[var], var++} else {print $0}}' cDNA_fasta_header.txt cDNA.fa > cDNA.correct_header.fa
ADD REPLY
0
Entering edit mode

In the tutorial, they specify that the cDNA_introns_t2g.txt file should be generated as follows:

cat cDNA_t2g.txt introns_t2g.txt > cDNA_introns_t2g.txt

But based on the tutorial, it doesn't seem like introns_t2g.txt is supposed to have the "-I" labels...

But you're right that their example file does have the "-I" labels. That's annoying...

ADD REPLY
1
Entering edit mode
4.1 years ago
ericrkofman ▴ 20

Yes @dsull was right. The tutorial is not correct, and you have to make sure that the cDNA_introns_t2g.txt file looks like this, where the first section of the file refers to cDNA transcripts, and the second half refers to intronic transcripts and thus has the "-I" label affixed:

ENST00000237247.1       ENSG00000118473 SGIP1
ENST00000237247.2       ENSG00000118473 SGIP1
ENST00000237247.3       ENSG00000118473 SGIP1
ENST00000237247.4       ENSG00000118473 SGIP1
ENST00000237247.5       ENSG00000118473 SGIP1
.....
ENST00000423888.999051-I        ENSG00000184319 AC002055.4
ENST00000423888.999052-I        ENSG00000184319 AC002055.4
ENST00000480246.999053-I        ENSG00000184319 AC002055.4
ENST00000480246.999054-I        ENSG00000184319 AC002055.4

Hopefully this helps anybody else who runs into this! And hopefully that tutorial gets fixed!!! Plz

ADD COMMENT
0
Entering edit mode

Thanks! Here's something that should work for generating the index (this is what worked on my end). It involves installing/upgrading kb-python's development release.

pip install git+https://github.com/pachterlab/kb_python@devel --upgrade
kb ref -i mouse_velocity_index.idx -g mouse_velocity_t2g.txt -f1 mouse_velocity_cdna.fa -f2 mouse_velocity_intron.fa -c1 mouse_velocity_cdna_t2c.txt -c2 mouse_velocity_intron_t2c.txt --workflow lamanno \
Mus_musculus.GRCm38.dna.primary_assembly.fa.gz \
Mus_musculus.GRCm38.101.gtf.gz
cat mouse_velocity_cdna.fa mouse_velocity_intron.fa > mouse_cDNA_introns.fa
kallisto index -i mouse_cDNA_introns.idx -k 31 mouse_cDNA_introns.fa
ADD REPLY

Login before adding your answer.

Traffic: 2189 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6