Hi,
I need to analyze a project about aberrant splicing in mRNA-seq (3 cases vs 3 controls). We are interested in detecting aberrant transcription events like intron retention, exon skipping, and alternative 5'/3' splice sites.
I've been recently learning about pseudoalignment tools for transcript quantification like kallisto and salmon. However, I couldn't find anywhere information about their usefulness for non-annotated isoforms and aberrant splicing events.
I've found a couple of tools able to analyze aberrant splicing: rMATS (https://rnaseq-mats.sourceforge.net/) and FRASER (https://www.nature.com/articles/s41467-020-20573-7).
Analyses
What we would like to analyze is:
- Differentially Expressed Genes
- Differentially Expressed Transcripts
- Differentially Expressed Exons
- Aberrant splicing
Tools:
I know how to analyze DE Genes with STAR+htseq-count+DESeq2, and I am learning how to do it with kallisto+DESeq2 and kallisto+swish.
I'm learning how to do DE Transcripts with kallisto+swish.
I know hot to analyze DE Exons with STAR + htseq-count + DEXSeq. Can DEXSeq do exon analysis using kallisto's input?
Aberrant splicing: from what I read from the manuals, rMATS uses either fastq or aligned bams (STAR), while FRASER uses bams as input.
UMIs
The sequencing that is currently being generated includes UMIs for the bulk RNA-seq. I know how to process them for alignments. However, all I find about UMIs in kallisto only mentions single-cell RNA-seq, and only on single-paired reads.
Robustness
As you see, we have quite a few research questions and analyses. For the sake of robustness, I would like to use the same input files and "compatible"/"equivalent" software whenever it is possible, instead of starting from scratch on each step.
Questions
We are in a hurry because of a competing paper and I am trying to prepare the analyses ahead of time to anticipate pitfalls and minimize delays before we get the sequencing data. I have a few doubts and questions that I would appreciate if someone could address:
- What would you recommend as tools to use for each of the analyses?
- Do you have a better recommendation for the aberrant splicing analysis?
- Is kallisto much better than STAR+htseq-count for gene-level quantification? (I suspect yes)
- Can kallisto work with UMIs and paired-end reads (R1, R2, R3.fastq.gz) ? (I don't see such thing in the manual)
- Will the DEG results be significantly different when run with/without UMIs?
- What would be the "less bad" way to compare DE-genes with DE-transcripts? STAR (with UMIs)+htseq+DESeq2 (genes) vs kallisto(no UMIs)+swish (transcripts) or kallisto(no UMIs)+swish for both genes&transcripts ?
- Can Kallisto run exon-level analysis? If not, what is "less bad", DESeq2 (genes, with UMIs) vs swisch (tx, no UMIs) vs DEXSeq (exon, with UMIs), or swish (genes+tx, noUMIs) vs DEXSeq (exon, with UMIs) ?
- If I cannot use UMIs on each analysis, should I just ignore them and not use them?
Do you have any extra suggestion on how to improve and speed up this analysis?
Thanks in advance,
Txema
I'll be honest: Bulk RNAseq with UMIs is very uncommon and has not been benchmarked (really by any tool). There are many considerations for processing such data.
It is definitely very possible to run kallisto with UMIs on bulk data (it's not documented) but it'll require some work and a few options to consider. Happy to talk more about it. That said, I'm not sure how concordant your results will be between UMI-based quantification and non-UMI-based quantification. I tried this with smart-seq3 (which has UMI-based reads as well as non-UMI reads) and observed poor concordance (though a lot of downstream results were pretty similar, albeit less so than I wanted). It might actually be better to NOT use the UMIs because paired-end bulk RNAseq based quantification (without UMIs) is, albeit not perfect, well-established in the field. Your data actually is really interesting (paired-end bulk with UMIs) and I'd really love to take a closer look myself once you publish it.
You can also check whether UMIs really make a difference (if a ton of reads in your dataset have the same UMI sequence, then there's some amplification bias). It's (like with all of RNAseq) very dataset/library dependent.
Apologies for this sort of nonanswer but the tl;dr is that your analysis is more complicated than one might think. If anything, I'd try out a couple of different tools and a few different options for each tool and check the agreement, and also using different tools for different purposes is not a bad idea either.
UMIs in bulk RNA-seq are used differently than in single-cell RNA-seq. (I only learned this last week after asking here in a different post).
In bulk, UMIs are added after fragmentation but before amplification. Thus, each fragment of the same original molecule will carry a different UMI barcode. However, they work wonders to remove PCR duplicates. I have processed data from a few projects with them.
These are the sum of counts in all genes from htseq-count:
The blue-green-gray-black samples are (poly-A enriched) mRNA-seq with target sequencing ~20M. The red-orange-pink are total RNA-seq (rRNA depleted) target ~60M reads.
I was mistaken in my original post. For this project (aberrant splicing) they used a different sequencing facility without UMIs.
However, my point/question stands for the many other projects I am currently working with.
Super interesting -- I've seen that sort of library prep for a few single-cell technologies as well. Makes sense because full-length-spanning reads cannot be 5'/3' end-tagged.
I suppose we should call these unique fragment identifiers (UFIs) instead of UMIs. Nonetheless, I think the UMI counting procedure used by kallisto and other tools should work just fine, but keeping in mind that longer genes will get more counts because they generate more fragments (at least amplification bias will be removed though).
Keep me posted on this, but kallisto can definitely handle this type of data (bustools can do the UMI collapsing and kallisto can leverage the final UMI count matrix to do its traditional quantification with effective length normalization) and this data would be super interesting to look at.
Feel free to reach out to me for this type of data -- I am one of the main contributors to the kallisto project (twitter handle in bio) and we can certainly figure out an optimal workflow to easily handle this type of data. :)
Hi,
I'm finally dealing with this type of data: UMI (or Unique Fragment Identifier) bulk RNA. I am using kallisto-bus, but I am having trouble to make it run.
My input samples come split into 1-5 lanes/libraries/fastq files. Each of these blobs comes in 3 fastq.gz files: R1 (first read of pair), R2(UMI), R3 (second read of pair). I used zcat + umitools to collapse all the libraries of my samples into a single "fastq" per sample, and using UMItools to add the UMI to the read header. This way I end up having only a pair of R1 / R2 fastq.gz files per sample.
I tried running kb with:
However, it gives this error message:
I cannot find anywhere in the documentation of kallisto what are those 4 files supposed to be. Should I be using 2 separate files for the reads and the UMIs? In which order should they be?
Thanks in advance
First thing: kallisto cannot interface with UMI-tools (kallisto doesn't use read headers and therefore you're stuck with "the kallisto way" of handling UMIs). I'm a bit busy with other developments to kallisto right now, but I might consider adding UMI-tools interfacing with kallisto, especially if it's a desired feature. For now, if you really want to use UMI-tools, you should take those read header sequences and put them back into a separate FASTQ file. If you have 3 FASTQ files, that's fine -- kallisto can handle that.
OK, now comes the tricky part if you want to give it a try (I just tried it and it seems to work on a minimal example; I'll document it once I streamline the workflow a bit):
Install kb_python (the "D" version)
Install kallisto (the "D" version) from source:
Recreate the index using the "D" version of kb-python and of kallisto (you need to supply genome.fasta and genome.gtf):
Now, run the quantification (where R1 = first read in pair, R2 = second read in pair, R3 = 8-bp UMI)
Your results will be in output_dir/counts_unfiltered/. Now let's use the EM algorithm to get estimated counts:
Your output of interest will be the file: final_output_dir/matrix.abundance.tpm.mtx (this contains your estimated TPMs as determined by the probablistic EM algorithm). The second column (starting from the third line) of the mtx file contains your transcript numbers with their associated counts (shown in the third column). The transcript numbers correspond to the line numbers (one-indexed) in final_output_dir/transcripts.txt.
You can also get gene-level abundances the same way (look at final_output_dir/matrix.abundance.gene.tpm.mtx) and then look at the gene line numbers in final_output_dir/genes.txt.
Yikes! I will try to streamline this workflow soon! Still a work in progress but I think the output will be consistent -- just need to make it user-friendly. Happy to explain any of the commands that were run or any of the files that were generated. If you notice anything funky, message me.
If it ends up working and you want to describe the methods in the manuscript, just write out the code above and say: "we used kb_python commit number: 73c62d77e894a39f2fd99370878ce895984186b8, and kallisto (version 0.49.0) commit 51a22e92e9c81203c53a596a8e3f95ae85e77865 and bustools (version 0.42.0)" so others can reproduce the results.
Thanks!
I finally went around running all this. It worked smoothly.
Could you guide me how to build "traditional" abundance.tsv files parting from "matrix.abundance.tpm.mtx" files so I can load the data with tximport? Is there a built-in method or should I manually build them from the output files?
Thanks!
Edit: is TPM even a valid metric after UMI deduplication? Why should I normalize by total number of reads when the PCR-amplification-bias "should" be gone? Is this just to compare between samples when merging them and then I should skip the normalization steps in DESeq2? Also, is the non-tpm file length-corrected? Why don't they give integer counts?
Glad to hear!
Hmm, you need to manually build abundance.tsv files -- I don't have code written for that yet. Alternatively, you can just load things into a data frame and use tximport (type="none") by manually specifying the columns and data -- that's probably the easiest route to take. You have the estimated counts matrix, you have the TPM matrix, you have the transcript+gene names, and you have the transcript lengths (by looking at the cdna.fa file and counting the number of bp's for each transcript).
So... the "matrix.abundance.mtx" file is equivalent to the "est_counts" column in the abundance.tsv file, no?
In the very end, I am interested in comparing the gene counts obtained with this method vs using a log1p transformation on the (DESeq2) normalized counts (which originated from htseq-count).
Can kallisto's estimated counts be (more or less) directly compared with that other measurement? Or are both kallisto's estimated counts and TPM corrected by transcript lenght? (or effective tx length).
Thanks!
Correct, the matrix.abundance.mtx is the estimated counts in abundance.tsv.
kallisto's TPM is corrected by effective transcript length.
kallisto's estimated counts are not length-corrected (but length correction doesn't matter for DESeq2).
tximport is still the best method for gene-level analysis so if you can feed the counts into tximport, that would be best (at least for gene-level analysis).
I can try introducing a new commit into kallisto that recapitulates the abundance.tsv structure from the mtx files sometime this week or next week to make it more usable with tools like tximport/swish/sleuth/etc..
Hi! In the latest commit of kallisto-D, I added an options to quant-tcc:
Add
--matrix-to-files
if you want everything outputted into the abundance TSV format expected by other programs.Note: I haven't tested this yet (taking several days off right now) -- so if you give this a try and if you encounter any bugs, let me know and I'll fix it as soon as I get back.
Thanks! I went back to this project last Friday and, by pure coincidence, I saw you added it on a commit just 10h before I checked that.
It works wonders, thanks!
Hi, sorry to bother you again.
I see now that this last version is not generating bootstrapping files when using "-b 30", neither in .h5 nor in plaintext. Even when the output messages say that it will do it:
Is this a bug or is this feature not implemented yet?
Hello again!
In order for it to be outputted as plaintext, supply the
--plaintext
option when runningquant-tcc
.In order for it to be outputted as HDF5, you need to recompile kallisto with HDF5 support:
-DUSE_HDF5=ON
i.e.
With the
--plaintext
option, you get abundance TSV files and a bunch of bs_abundance TSV files (for the bootstrap).Without the --plaintext option, you still get an abundance TSV file but the HDF5 file will only be generated if kallisto was compiled with
-DUSE_HDF5=ON
. Note that all bootstraps are incorporated into a single HDF5 file.Nonetheless, I've updated the Github just now to print out more clear messages.
Hi, thanks for your swift response.
Unfortunately I am still having trouble with this. I downloaded and compiled the latest version with HDF5 support. However, I am getting the exact same output file structure whenever I try to use --plaintext or not:
Command:
Output messages:
Output directory:
As you can see, there is no .h5, nor bs_abundance files in neither of the cases.
I tried to run the same when compiling with/without
-DUSE_HDF5=ON
and also using the previous commit of the code. In all cases I get the exact same output file structure. No .h5, no bs_abundanceAm I running this with the wrong command? is
-b 30
not enough to trigger bootstrapping?You need to run
quant-tcc
with the--matrix-to-files
optionOh gosh
I glanced over the original message where you first stated that
Thank you so much!
I ran
quant-tcc
, Using eithermatrix.abundance.mtx
ormatrix.abundance.tpm.mtx
Can we load it as a normal count matrix and create a seurat object for further filtering, scaling, normalization etc? Which matrix file would give better accuracy on the transcript expression ? I would like to evaluate gene isoforms expression across specific cell clusters. Do I still need to process first with tximport ?