Question

Build a Kallisto transcriptome index

0

Entering edit mode

6.1 years ago

Farah ▴ 80

Hello,

I need to pseudo-align my paired reads to the transcriptome using Kallisto. I know that Kallisto does not use a reference genome sequence, and instead it performs pseudo-alignment to determine the compatibility of reads with targets (e.g. transcript sequences).

However, to determine the compatibility of reads with target transcript sequences (to build a Kallisto transcriptome index), how can I choose my targeted reference transcriptome which is human and also Cassava Brown Streak Virus?

I mean, for running the below codes to create the Kallisto index from the transcriptome, should I specify which transcriptome I want to use (e.g. for human or for Cassava Brown Streak Virus)? If so, how to know what is the appropriate transcriptome that I should use for my targeted genomes?

cd
kallisto index -i Potra01-mRNA.idx \
~/share/Day01/data/reference/fasta/Potra01-mRNA.fa.gz

Thank you so much for your advise and guide. Best wishes

Kallisto RNA-Seq pseudo-alignment • 16k views

ADD COMMENT • link updated 6.1 years ago by Lior Pachter ▴ 720 • written 6.1 years ago by Farah ▴ 80

score 0 · Answer 1 · 2019-07-16

0

Entering edit mode

6.1 years ago

Lior Pachter ▴ 720

It sounds like your goal is to build an index from both the human and the Cassava Brown Steak Virus at the same time. You can do this by obtaining the transcriptomes for each separately, and then building an index using both files: kallisto index -i name.idx human.fa.gz cassava_brown_steak_virus.fa.gz. You can then quantify reads against both simultaneously.

ADD COMMENT • link 6.1 years ago by Lior Pachter ▴ 720

1

Entering edit mode

The actual fasta files can be downloaded from public data bases such as Ensembl, as described here and here. You want to look for the cDNA bit in the file name since you want to limit yourself to those parts of the genome that refer to the transcribed loci.

ADD REPLY • link 6.1 years ago by Friederike 9.0k

0

Entering edit mode

Thanks a lot Friederike for your guide. After downloading transcriptome fasta files, then, the name of fasta file would be for fa.gz file? what about name.idx?

Many thanks.

ADD REPLY • link 6.1 years ago by Farah ▴ 80

1

Entering edit mode

I believe Lior was just trying to indicate that you can put whatever name you want the resulting index to have following --i.

I.e., if you want two indeces, one for the human, one for the virus cDNA libraries, you will run the command twice:

kallisto index -i my_human_index.idx name_of_the_fasta_file_for_the_human_cDNA_collection.gz # generates the index to be used with the human samples

kallisto index -i my_virus_index.idx name_of_the_fasta_file_for_the_virus_cDNA_collection.gz # generates the index to be used with the virus samples

ADD REPLY • link 6.1 years ago by Friederike 9.0k

0

Entering edit mode

Many thanks Friederike. I could find fasta files for human and also plants, and I did indexing for them. However, I could not find transcriptome fasta file for Cassava Brown Streak Virus or its close species (TAN70 virus). I would highly appreciate if you can help me from where I can get it.

Many thanks.

ADD REPLY • link 6.1 years ago by Farah ▴ 80

1

Entering edit mode

Sorry, I've never had to download a viral cDNA index, so I'd have to resort to the usual tools (google etc.) just like you.

ADD REPLY • link 6.1 years ago by Friederike 9.0k

0

Entering edit mode

OK. Thank you very much Friederike.

ADD REPLY • link 6.1 years ago by Farah ▴ 80

0

Entering edit mode

Thank you very much Lior. In fact, I want to build an index from both the human and the Cassava Brown Steak Virus separately. I have two different RNA-seq datasets (one for human and another one for Cassava Brown Steak Virus). I need to know how can I obtain transcriptomes for human and the Cassava Brown Steak Virus separately?

Then, I want to know what should I exactly write for name.idx and both fa.gz files for human and the Cassava Brown Steak Virus separately?

Many thanks for the help.

ADD REPLY • link 6.1 years ago by Farah ▴ 80

0

Entering edit mode

I tried this first...but it resulted in an index around 2.3Gb that failed in the kallisto quant step (and had weird errors like bazillions of equivalence classes, ran out of memory, etc)

kallisto index -i GRch38_GRCm38_cdna.idx Homo_sapiens.GRCh38.cdna.all.fa.gz Mus_musculus.GRCm38.cdna.all.fa.gz

But then I unzipped and concatenated them and tried again...and got an index of 4.4Gb, and it worked with the kallisto quant step

gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz
gunzip Mus_musculus.GRCm38.cdna.all.fa.gz
cat Homo_sapiens.GRCh38.cdna.all.fa Mus_musculus.GRCm38.cdna.all.fa > GRch38_GRCm38_cdna.fa
kallisto index -i GRch38_GRCm38_cdna.idx GRch38_GRCm38_cdna.fa

Also, don't forget about the 'kallisto inspect' feature..this was helpful to run the 'kallisto inspect' on the new index, without having to run a kallisto quant run

kallisto inspect GRch38_GRCm38_cdna.idx

ADD REPLY • link 5.5 years ago by Dylan Richards • 0