How should I make kallisto indexes?
2
0
Entering edit mode
8 months ago
bioinfo ▴ 150

Hello,

I work with bulk RNA seq data and but I have been using an older version of kallisto. I want to update both the kallisto version and the indexes.

I used to make the indexes like this

kallisto index -i transcripts.idx transcripts.fasta.gz

as shown in this link:https://pachterlab.github.io/kallisto/starting.html

However, I found the indexes here https://github.com/pachterlab/kallisto-transcriptome-indices which were made like this:

kb ref --workflow=standard -i index.idx -g t2g.txt -f1 cdna.fa \
  --include-attribute gene_biotype:protein_coding \
  --include-attribute gene_biotype:lncRNA \
  --include-attribute gene_biotype:lincRNA \
  --include-attribute gene_biotype:antisense \
  --include-attribute gene_biotype:IG_LV_gene \
  --include-attribute gene_biotype:IG_V_gene \
  --include-attribute gene_biotype:IG_V_pseudogene \
  --include-attribute gene_biotype:IG_D_gene \
  --include-attribute gene_biotype:IG_J_gene \
  --include-attribute gene_biotype:IG_J_pseudogene \
  --include-attribute gene_biotype:IG_C_gene \
  --include-attribute gene_biotype:IG_C_pseudogene \
  --include-attribute gene_biotype:TR_V_gene \
  --include-attribute gene_biotype:TR_V_pseudogene \
  --include-attribute gene_biotype:TR_D_gene \
  --include-attribute gene_biotype:TR_J_gene \
  --include-attribute gene_biotype:TR_J_pseudogene \
  --include-attribute gene_biotype:TR_C_gene \
  genome.fa.gz genome.gtf.gz

Which is the most appropriate way to make a reference with kallisto version >=0.50.1 for bulk data? Also, if I use the newer versions from the kallisto website would it be more appropriate to assign gene names using the t2g.txt file instead of using biomart?

Thank you

kallisto • 1.2k views
ADD COMMENT
0
Entering edit mode

As an additional note, all those --include-attribute stuff was basically to restrict the transcriptome to precisely the items that Cell Ranger (10X genomics) includes in their STAR index. Those items seem to be good whether bulk, single-cell, or single-nucleus, and whether 10X or some other vendor.

ADD REPLY
1
Entering edit mode
7 months ago
dsull ★ 7.0k

I recommend the kb ref method for two reasons:

  1. The index created will yield more accurate mapping for datasets that have a substantial amount of non-exonic reads (like total RNA-seq).

  2. kb ref can directly give you a reference transcriptome from a genome FASTA and GTF and allow you to exclude regions that you don't want (like pseudogenes). You don't really have as much control over this if you simply download the ENSEMBL cDNA FASTA.

Of note, kb ref calls kallisto index directly under the hood, but conveniently wraps the features mentioned above. In fact, you could recapitulate the exact results of kb ref via kallisto index if you'd like.

The t2g.txt file is an output file generated automatically by kb ref -- there's nothing to supply. (Indeed, the generation of the t2g.txt file is an additional advantage of using kb ref).

The updated kallisto documentation is here: https://www.biorxiv.org/content/10.1101/2023.11.21.568164v2.full.pdf

I'm still in the process of converting the documentation to an actual website and a better typesetted PDF (but am still getting feedback from reviewers at the moment).

I'm happy to discuss the improvements of kb ref (or, in general, using the kb-python framework) in more detail in the comments if you'd like.

ADD COMMENT
0
Entering edit mode

Thank you. I will read the documentation you attached. Are you going to slowly phase out the not kb ref commands? Also, does the t2g.txt file get created using the t2g.py script from here https://github.com/pachterlab/kallisto-transcriptome-indices/releases/download/ensembl-96/t2g.py or has something changed? Thank you so much for your help.

ADD REPLY
0
Entering edit mode

We can't phase them out per se, because kallisto index is still the engine responsible for creating the index under the hood (and therefore will always be available to use). However, moving forward, the recommendation for RNAseq will be to use kb-python for everything (unless you have a special use case that requires using kallisto directly). If you can use kb-python, please use it.

That t2g.py is not what's responsible for giving you the t2g.txt file. If you look at ref.py in the kb-python repo, that's where the relevant code is.

ADD REPLY
0
Entering edit mode
7 months ago
Bajaj • 0

Hi, you can still use kallisto index -i transcripts.index transcripts.fa (extract the gz file) for making index files. I have tried it and it works perfectly fine.

ADD COMMENT

Login before adding your answer.

Traffic: 2228 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6