Hi,
I have been walking through the tutorials at https://www.kallistobus.tools/velocity_tutorial.html and https://www.kallistobus.tools/velocity_index_tutorial.html, and have run into a few issues including new versions of bustools using slightly altered commands than what the tutorial specifies.
Even after untangling some of these issues and getting to a point where I think this should be working, I'm still running into trouble with actually getting count to output the data I would expect.
For reference, I first ran:
kallisto bus -i ../velocity_index/cDNA_introns.idx -o bus_output_06/ -x 10xv2 -t 4 TYR4/TYR4_0_1_HC3HMDSXY/bamtofastq_S1_L004_R1_001.fastq.gz TYR4/TYR4_0_1_HC3HMDSXY/bamtofastq_S1_L004_R2_001.fastq.gz
This produces the output.bus, matrix.ec and transcripts.txt files in the bus_output_06 folders. Then, we have to run
bustools correct -w ../10xv2_whitelist.txt -p output.bus | bustools sort -o output.correct.sort.bus -t 4 -
This yields the output.correct.sort.bus files. Using that, then I run:
bustools capture -s -o cDNA_capture.bus -c cDNA_transcripts.to_capture.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus
bustools capture -s -o introns_capture.bus -c introns_transcripts.to_capture.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus
Followed by the count step:
bustools count -o u -g cDNA_introns_t2g.txt -e matrix.ec -t transcripts.txt --genecounts cDNA_capture.bus
bustools count -o s -g cDNA_introns_t2g.txt -e matrix.ec -t transcripts.txt --genecounts introns_capture.bus
But I get empty outputs:
-rw-r--r-- 1 ekofman ---- 607K Oct 7 07:51 s.barcodes.txt
-rw-r--r-- 1 ekofman ---- 0 Oct 7 07:51 s.genes.txt
-rw-r--r-- 1 ekofman ---- 114 Oct 7 07:51 s.mtx
drwxr-xr-x 2 ekofman ---- 2 Oct 6 11:33 tmp
-rw-r--r-- 1 ekofman ---- 52M Oct 5 17:01 transcripts.txt
-rw-r--r-- 1 ekofman ---- 618K Oct 19 12:25 u.barcodes.txt
-rw-r--r-- 1 ekofman ---- 0 Oct 19 12:25 u.genes.txt
-rw-r--r-- 1 ekofman ---- 114 Oct 19 12:25 u.mtx
So! What could be going wrong? My first thought was maybe the transcripts are mismatched, from the velocity_index_tutorial page.
So let's check:
My cDNA_introns_t2g.txt file looks like this, with 2244276 lines:¶
ENST00000237247 ENSG00000118473 SGIP1
ENST00000237247 ENSG00000118473 SGIP1
ENST00000237247 ENSG00000118473 SGIP1
My matrix.ec looks like this, with 12216809 lines:¶
0 0
1 1
2 2
3 3
4 4
My transcripts.txt file looks like this:
ENST00000237247.1
ENST00000237247.2
ENST00000237247.3
ENST00000237247.4
My cDNA_capture.bus (converted to text, 1663579 lines) looks like this:
AAACCTGAGTTAAGTG CCGAGCTCAA 2467401 1
AAACCTGAGTTAAGTG CGCTTTAGTA 11871347 1
AAACCTGAGTTAAGTG CGGCACGAGG 11781147 1
AAACCTGAGTTAAGTG CGGGTTTCAC 590261 1
I honestly have no clue what is wrong. I intended to use transcripts without version IDs, and I think that the increasing number after the decimal place is supposed to be there, added during the index-making stage as specified by the tutorial. I feel like I'm very close to breaking through on this and finally getting results, but the tutorial being out of date and hard to debug makes it tough! It would be great to hear from anybody else who maybe has dealt with similar issues and might know what to do about this. Thanks!
Hmm, something's off with your cDNA_introns_t2g.txt file.
If you look at the example cDNA_introns.t2g.txt.gz file supplied in the tutorial, it is formatted with version numbers AND has special designations for intronic regions (e.g. ENST00000417324.21-I).
Can you show how you're building the index?
I'd recommend following the notebook tutorials here: https://www.kallistobus.tools/tutorials
@dsull sorry for the delay
I actually followed along exactly with the index-building tutorial, except that I had to alter a few steps:
I actually had to use
I also had to change the introns command...
also this command was different (re-headering the cDNA fasta)
In the tutorial, they specify that the cDNA_introns_t2g.txt file should be generated as follows:
But based on the tutorial, it doesn't seem like introns_t2g.txt is supposed to have the "-I" labels...
But you're right that their example file does have the "-I" labels. That's annoying...