Hello,
I work with bulk RNA seq data and but I have been using an older version of kallisto. I want to update both the kallisto version and the indexes.
I used to make the indexes like this
kallisto index -i transcripts.idx transcripts.fasta.gz
as shown in this link:https://pachterlab.github.io/kallisto/starting.html
However, I found the indexes here https://github.com/pachterlab/kallisto-transcriptome-indices which were made like this:
kb ref --workflow=standard -i index.idx -g t2g.txt -f1 cdna.fa \
--include-attribute gene_biotype:protein_coding \
--include-attribute gene_biotype:lncRNA \
--include-attribute gene_biotype:lincRNA \
--include-attribute gene_biotype:antisense \
--include-attribute gene_biotype:IG_LV_gene \
--include-attribute gene_biotype:IG_V_gene \
--include-attribute gene_biotype:IG_V_pseudogene \
--include-attribute gene_biotype:IG_D_gene \
--include-attribute gene_biotype:IG_J_gene \
--include-attribute gene_biotype:IG_J_pseudogene \
--include-attribute gene_biotype:IG_C_gene \
--include-attribute gene_biotype:IG_C_pseudogene \
--include-attribute gene_biotype:TR_V_gene \
--include-attribute gene_biotype:TR_V_pseudogene \
--include-attribute gene_biotype:TR_D_gene \
--include-attribute gene_biotype:TR_J_gene \
--include-attribute gene_biotype:TR_J_pseudogene \
--include-attribute gene_biotype:TR_C_gene \
genome.fa.gz genome.gtf.gz
Which is the most appropriate way to make a reference with kallisto version >=0.50.1 for bulk data? Also, if I use the newer versions from the kallisto website would it be more appropriate to assign gene names using the t2g.txt file instead of using biomart?
Thank you
As an additional note, all those
--include-attribute
stuff was basically to restrict the transcriptome to precisely the items that Cell Ranger (10X genomics) includes in their STAR index. Those items seem to be good whether bulk, single-cell, or single-nucleus, and whether 10X or some other vendor.