Question

indexing the drosophila genome for kallisto

0

Entering edit mode

4 months ago

gogeni5529 ▴ 50

I want to index the drosophila genome to run kallisto (v. 0.50).

I was wondering which fasta file I should use, when downloading it from the Ensembl FTP site - the dna.toplevel fasta or the cdna.all fasta?

The command then would be e.g. kallisto index -i Dme.BDGP6.46.idx Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz if taking the dna file.

Is that correct?

index kallisto • 497 views

ADD COMMENT • link 4 months ago by gogeni5529 ▴ 50

0

Entering edit mode

4 months ago

dsull ★ 6.9k

You should use the cdna one.

kallisto index works on the reference TRANSCRIPTOME (i.e. cdna), not genome. The toplevel dna file is the genome so that's not what you want.

ADD COMMENT • link 4 months ago by dsull ★ 6.9k

0

Entering edit mode

but if you look at the examples given by the people created the example indicies, they also took the dna and not the cdna to create the index in version 0.50.

see kallisto-transcriptome-indices

The files they are using are all dna primary assembly files, no cdna.

ADD REPLY • link 4 months ago by gogeni5529 ▴ 50

1

Entering edit mode

I was the one who created those indices and I replied to you on the GitHub issues. My reply is reproduced below:

This indices were created by kb-python: kb-python takes in the GENOME fasta and extracts a TRANSCRIPTOME fasta from it (and then calls the kallisto index command on the TRANSCRIPTOME fasta that it just extracted).

If you don't use kb-python and simply stick with calling the kallisto index command, then you should be using the TRANSCRIPTOME fasta. The kallisto index command always uses the TRANSCRIPTOME fasta.

ADD REPLY • link 4 months ago by dsull ★ 6.9k

score 1 · Accepted Answer · 2024-07-05

thx dsull for the explanation. that makes sense. I manage to create the drosophila genome using the following command (I copied it from your github repository.

kb ref --workflow=standard -i index.idx -g t2g.txt -f1 Drosophila_melanogaster.BDGP6.46.cdna.fa \
  --include-attribute gene_biotype:protein_coding \
  --include-attribute gene_biotype:lncRNA \
  --include-attribute gene_biotype:lincRNA \
  --include-attribute gene_biotype:antisense \
  --include-attribute gene_biotype:IG_LV_gene \
  --include-attribute gene_biotype:IG_V_gene \
  --include-attribute gene_biotype:IG_V_pseudogene \
  --include-attribute gene_biotype:IG_D_gene \
  --include-attribute gene_biotype:IG_J_gene \
  --include-attribute gene_biotype:IG_J_pseudogene \
  --include-attribute gene_biotype:IG_C_gene \
  --include-attribute gene_biotype:IG_C_pseudogene \
  --include-attribute gene_biotype:TR_V_gene \
  --include-attribute gene_biotype:TR_V_pseudogene \
  --include-attribute gene_biotype:TR_D_gene \
  --include-attribute gene_biotype:TR_J_gene \
  --include-attribute gene_biotype:TR_J_pseudogene \
  --include-attribute gene_biotype:TR_C_gene \
  Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa Drosophila_melanogaster.BDGP6.46.112.gtf