Visualizing alignment of kallisto .bam files to hg19 (and custom plasmids)
2
0
Entering edit mode
6.2 years ago

Hello,

I'm trying to visualize kallisto read alignment to a human genome plus an additional sequence using IGV. I can produce .bam and .bai files using kallisto, but when I open them in IGV alongside the built-in "hg19" option, I don't see anything:

example IGV window - no alignment where I would expect to see something

Gathering Files

First, I downloaded my reference human genome from here, along with a gtf:

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

wget 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/mrna.fa.gz'

Gtf downloaded from the UCSC genome browser using these instructions: hg38 annotation file (gtf) UCSC format

Then, I download my additional sequence, "plas_shp", which looks like this:

head plasmidFasta/plas_shp_up.fa
>plas_shp
GTCGACATTGATTATTGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATA
GCCCATATATGGAGTTCCGCGTTACATAACTTACGGTAAATGGCCCGCCTGGCTGACCGC
CCAACGACCCCCGCCCATTGACGTCAATAATGACGTATGTTCCCATAGTAACGCCAATAG
GGACTTTCCATTGACGTCAATGGGTGGAGTATTTACGGTAAACTGCCCACTTGGCAGTAC
ATCAAGTGTATCATATGCCAAGTACGCCCCCTATTGACGTCAATGACGGTAAATGGCCCG
CCTGGCATTATGCCCAGTACATGACCTTATGGGACTTTCCTACTTGGCAGTACATCTACG
TATTAGTCATCGCTATTACCATGGTCGAGGTGAGCCCCACGTTCTGCTTCACTCTCCCCA
TCTCCCCCCCCTCCCCACCCCCAATTTTGTATTTATTTATTTTTTAATTATTTTGTGCAG
CGATGGGGGCGGGGGGGGGGGGGGGGCGCGCGCCAGGCGGGGCGGGGCGGGGCGAGGGGC

Producing Index

I begin by producing my kallisto index. The code looks like this:

#declare Variables                                                                                                                          
humFa="myPath/hg19_transcriptome/*.fa"  
plasFa="myPath/plasmidFasta/*.fa"

 # run index gen
  kallisto index -i /myPath/humPlas_kallisto_transcripts_plasmids_hg19.idx $humFa $plasFa --make-unique

Producing .bam file

I modified the gtf file I downloaded above to add my extra sequences, so it looked like this:

tail myGtf.gtf
chr19_gl000209_random   hg19_refGene    stop_codon  145118  145120  0.000000    +   .   gene_id "NM_012312"; transcript_id "NM_012312"; 
chr19_gl000209_random   hg19_refGene    exon    145079  145743  0.000000    +   .   gene_id "NM_012312"; transcript_id "NM_012312"; 
plas_shp    AddedGenes  exon    1   12915   .   +   0   gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp"; tss_id "plas_shp";

Then I aligned kallisto to the index I produced above using the above:

# run index gen
kallisto quant -i $kallistoIdx -o $outputFileLoc/$sample --genomebam --gtf $myGtf $Read1 $Read2 | samtools view -bS > $sample_align.bam

When I convert the output .bam file to .sam, it looks like this:

NS500127:27:HW5JKBGXY:1:11101:17213:1141    131 *   0   0   76M *   0   0   GGCGTTGCTGCGGGTGGCCTGCGTGTGCCGCTTATGGAGGGAGTGTGTGCGCAGAGTATTGCGGACCCATCGGAGC    AAAAAEEEEEAEEEAEEEEE/AEE/AEEEEEEAEEEE/EEEEE/EAE/AAE//E/AEE/EEAEA//E/<A/EEEEA    ZW:f:0
NS500127:27:HW5JKBGXY:1:11101:24672:1142    67  *   0   0   76M *   0   0   TTTGATCTTCTTAGTTTTCTTCTTCTTATCCTTACCGCTGTCATCCTCCTCATCTGAACCCACATCTTCGATCTTG    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEAEEEEEEEEEAE/EEEEEEEEEEEE//EEEEA    ZW:f:0

The problem

When I try to open the .bam file in IGV, I don't see any alignments. I've tried doing the above with hg19/GRCh37 and 38 files from different locations. I've tried including and excluding the chromSizes file. I'm not sure what else to try.

Is there anything immediately obviously problematic about my approach?

Is there a viewer that might be more forgiving than IGV when it comes to visualizing kallisto .bam files?

Has anyone successfully managed to visualize their kallisto .bam files in IGV before?

Thank you for your help!

IGV kallisto RNA-Seq • 4.2k views
ADD COMMENT
2
Entering edit mode
6.2 years ago
pmelsted ▴ 130

Everything up to the GTF file was correct. The only thing missing is that kallisto requires a gene entry for each transcript and the exons of each transcript. Also it's important the the field separators for the GTF are TAB characters, where as the key-value entries are separted by space and ;

plas_shp    AddedGenes  gene    1   12915   .   +   0   gene_id "plas_shp"; gene_name "plas_shp";
plas_shp    AddedGenes  transcript  1   12915   .   +   0   gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp";
plas_shp    AddedGenes  exon    1   12915   .   +   0   gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp"; tss_id "plas_shp";

this line kallisto quant -i $kallistoIdx -o $outputFileLoc/$sample --genomebam --gtf $myGtf $Read1 $Read2 | samtools view -bS > $sample_align.bam is unnecessary since kallisto with --genomebam doesn't write to standard output, but produces a sorted indexed BAM file under outputFolder/pseudoalignments.bam.

After this you should be able to view this using IGV. You need to either index your plasmid fasta file using samtools faidx or you can add them as extra chromosomes to the Human reference (but give them a new chromosome name and modify the GTF and run samtools faidx again).

I tested this on a small example

@r1
TGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATAGCCCATATATGGAGT
+
AAAAAEEEEEAEEEAEEEEE/AEE/AEEEEEEAEEEE/EEEEE/EAE/AAE//E/AEE/E
@r2
TGGGCATAATGCCAGGCGGGCCATTTACCGTCATTGACGTCAATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEA

using the head of the plasmid sequence you provided and got the following results kallisto quant -i ind -o out --genomebam --gtf plas.gtf --single -l 200 -s20 r1.fastq and looking at the output samtools view out/pseudoalignments.bam

r1  0   plas_shp    16  255 60M *   0   0   TGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATAGCCCATATATGGAGT    AAAAAEEEEEAEEEAEEEEE/AEE/AEEEEEEAEEEE/EEEEE/EAE/AAE//E/AEE/E    ZW:f:1
r2  16  plas_shp    272 255 45M *   0   0   TATTGACGTCAATGACGGTAAATGGCCCGCCTGGCATTATGCCCA   AEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA   ZW:f:1
ADD COMMENT
0
Entering edit mode

Thank you for this extremely helpful answer! I'll give it a try!

ADD REPLY
0
Entering edit mode

Thank you again for your suggested tips. I tried editing my gtf, removing the samview portion of my kallisto quant line, and re-running the pipeline. This time, I did it both the original way (trying to build index from plasmid+ transcriptome) as well as a simplified vdrsion (trying to build index from transcriptome .fa only). Unfortunately, both versions failed to align properly (e.g. sam output very similar to what I posted above with no chromosome names) and I could not visualize them on IGV. I'll focus on figuring out what's wrong with just the basic transcriptome alignment.

Regarding your other suggestion using samtools faidx - could you help me understand how this would enter into the pipeline? I see that it makes .fa.fai files, but if you aren't feeding .fai files into kallisto index, just .fa files (see section "producing index" for example), why would it matter they exist?

Thank you again for your help!

ADD REPLY
0
Entering edit mode
6.2 years ago

Your bam has no indication whatsoever what chromosome each read goes to. So of course IGV can't put reads on the human chromosomes.

If you want to look at alignments in IGV, you need to align to the human genome with a real aligner, like STAR, not just count transcripts with kallisto.

ADD COMMENT
0
Entering edit mode

Thanks for pointing out where it's falling down. You actually can use IGV with kallisto now: https://pachterlab.github.io/kallisto/pseudobam.html

ADD REPLY
0
Entering edit mode

Thanks for pointing out where it's falling down. You actually can use IGV with kallisto now: https://pachterlab.github.io/kallisto/pseudobam.html

ADD REPLY
0
Entering edit mode

Thanks for pointing out where it's falling down. You actually can use IGV with kallisto now: https://pachterlab.github.io/kallisto/pseudobam.html

ADD REPLY

Login before adding your answer.

Traffic: 2732 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6