Question

How should I make kallisto indexes?

0

Entering edit mode

8 months ago

bioinfo ▴ 150

Hello,

I work with bulk RNA seq data and but I have been using an older version of kallisto. I want to update both the kallisto version and the indexes.

I used to make the indexes like this

kallisto index -i transcripts.idx transcripts.fasta.gz

as shown in this link:https://pachterlab.github.io/kallisto/starting.html

However, I found the indexes here https://github.com/pachterlab/kallisto-transcriptome-indices which were made like this:

kb ref --workflow=standard -i index.idx -g t2g.txt -f1 cdna.fa \
  --include-attribute gene_biotype:protein_coding \
  --include-attribute gene_biotype:lncRNA \
  --include-attribute gene_biotype:lincRNA \
  --include-attribute gene_biotype:antisense \
  --include-attribute gene_biotype:IG_LV_gene \
  --include-attribute gene_biotype:IG_V_gene \
  --include-attribute gene_biotype:IG_V_pseudogene \
  --include-attribute gene_biotype:IG_D_gene \
  --include-attribute gene_biotype:IG_J_gene \
  --include-attribute gene_biotype:IG_J_pseudogene \
  --include-attribute gene_biotype:IG_C_gene \
  --include-attribute gene_biotype:IG_C_pseudogene \
  --include-attribute gene_biotype:TR_V_gene \
  --include-attribute gene_biotype:TR_V_pseudogene \
  --include-attribute gene_biotype:TR_D_gene \
  --include-attribute gene_biotype:TR_J_gene \
  --include-attribute gene_biotype:TR_J_pseudogene \
  --include-attribute gene_biotype:TR_C_gene \
  genome.fa.gz genome.gtf.gz

Which is the most appropriate way to make a reference with kallisto version >=0.50.1 for bulk data? Also, if I use the newer versions from the kallisto website would it be more appropriate to assign gene names using the t2g.txt file instead of using biomart?

Thank you

kallisto • 1.2k views

ADD COMMENT • link updated 7 months ago by dsull ★ 7.0k • written 8 months ago by bioinfo ▴ 150

0

Entering edit mode

As an additional note, all those --include-attribute stuff was basically to restrict the transcriptome to precisely the items that Cell Ranger (10X genomics) includes in their STAR index. Those items seem to be good whether bulk, single-cell, or single-nucleus, and whether 10X or some other vendor.

ADD REPLY • link 7 months ago by dsull ★ 7.0k

0

Entering edit mode

7 months ago

Bajaj • 0

Hi, you can still use kallisto index -i transcripts.index transcripts.fa (extract the gz file) for making index files. I have tried it and it works perfectly fine.

ADD COMMENT • link updated 7 months ago by GenoMax 148k • written 7 months ago by Bajaj • 0

score 1 · Accepted Answer · 2024-04-29

I recommend the kb ref method for two reasons:

The index created will yield more accurate mapping for datasets that have a substantial amount of non-exonic reads (like total RNA-seq).
kb ref can directly give you a reference transcriptome from a genome FASTA and GTF and allow you to exclude regions that you don't want (like pseudogenes). You don't really have as much control over this if you simply download the ENSEMBL cDNA FASTA.

Of note, kb ref calls kallisto index directly under the hood, but conveniently wraps the features mentioned above. In fact, you could recapitulate the exact results of kb ref via kallisto index if you'd like.

The t2g.txt file is an output file generated automatically by kb ref -- there's nothing to supply. (Indeed, the generation of the t2g.txt file is an additional advantage of using kb ref).

The updated kallisto documentation is here: https://www.biorxiv.org/content/10.1101/2023.11.21.568164v2.full.pdf

I'm still in the process of converting the documentation to an actual website and a better typesetted PDF (but am still getting feedback from reviewers at the moment).

I'm happy to discuss the improvements of kb ref (or, in general, using the kb-python framework) in more detail in the comments if you'd like.