Hello,
I'm trying to visualize kallisto read alignment to a human genome plus an additional sequence using IGV. I can produce .bam and .bai files using kallisto, but when I open them in IGV alongside the built-in "hg19" option, I don't see anything:
Gathering Files
First, I downloaded my reference human genome from here, along with a gtf:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
wget 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/mrna.fa.gz'
Gtf downloaded from the UCSC genome browser using these instructions: hg38 annotation file (gtf) UCSC format
Then, I download my additional sequence, "plas_shp", which looks like this:
head plasmidFasta/plas_shp_up.fa
>plas_shp
GTCGACATTGATTATTGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATA
GCCCATATATGGAGTTCCGCGTTACATAACTTACGGTAAATGGCCCGCCTGGCTGACCGC
CCAACGACCCCCGCCCATTGACGTCAATAATGACGTATGTTCCCATAGTAACGCCAATAG
GGACTTTCCATTGACGTCAATGGGTGGAGTATTTACGGTAAACTGCCCACTTGGCAGTAC
ATCAAGTGTATCATATGCCAAGTACGCCCCCTATTGACGTCAATGACGGTAAATGGCCCG
CCTGGCATTATGCCCAGTACATGACCTTATGGGACTTTCCTACTTGGCAGTACATCTACG
TATTAGTCATCGCTATTACCATGGTCGAGGTGAGCCCCACGTTCTGCTTCACTCTCCCCA
TCTCCCCCCCCTCCCCACCCCCAATTTTGTATTTATTTATTTTTTAATTATTTTGTGCAG
CGATGGGGGCGGGGGGGGGGGGGGGGCGCGCGCCAGGCGGGGCGGGGCGGGGCGAGGGGC
Producing Index
I begin by producing my kallisto index. The code looks like this:
#declare Variables
humFa="myPath/hg19_transcriptome/*.fa"
plasFa="myPath/plasmidFasta/*.fa"
# run index gen
kallisto index -i /myPath/humPlas_kallisto_transcripts_plasmids_hg19.idx $humFa $plasFa --make-unique
Producing .bam file
I modified the gtf file I downloaded above to add my extra sequences, so it looked like this:
tail myGtf.gtf
chr19_gl000209_random hg19_refGene stop_codon 145118 145120 0.000000 + . gene_id "NM_012312"; transcript_id "NM_012312";
chr19_gl000209_random hg19_refGene exon 145079 145743 0.000000 + . gene_id "NM_012312"; transcript_id "NM_012312";
plas_shp AddedGenes exon 1 12915 . + 0 gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp"; tss_id "plas_shp";
Then I aligned kallisto to the index I produced above using the above:
# run index gen
kallisto quant -i $kallistoIdx -o $outputFileLoc/$sample --genomebam --gtf $myGtf $Read1 $Read2 | samtools view -bS > $sample_align.bam
When I convert the output .bam file to .sam, it looks like this:
NS500127:27:HW5JKBGXY:1:11101:17213:1141 131 * 0 0 76M * 0 0 GGCGTTGCTGCGGGTGGCCTGCGTGTGCCGCTTATGGAGGGAGTGTGTGCGCAGAGTATTGCGGACCCATCGGAGC AAAAAEEEEEAEEEAEEEEE/AEE/AEEEEEEAEEEE/EEEEE/EAE/AAE//E/AEE/EEAEA//E/<A/EEEEA ZW:f:0
NS500127:27:HW5JKBGXY:1:11101:24672:1142 67 * 0 0 76M * 0 0 TTTGATCTTCTTAGTTTTCTTCTTCTTATCCTTACCGCTGTCATCCTCCTCATCTGAACCCACATCTTCGATCTTG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEAEEEEEEEEEAE/EEEEEEEEEEEE//EEEEA ZW:f:0
The problem
When I try to open the .bam file in IGV, I don't see any alignments. I've tried doing the above with hg19/GRCh37 and 38 files from different locations. I've tried including and excluding the chromSizes file. I'm not sure what else to try.
Is there anything immediately obviously problematic about my approach?
Is there a viewer that might be more forgiving than IGV when it comes to visualizing kallisto .bam files?
Has anyone successfully managed to visualize their kallisto .bam files in IGV before?
Thank you for your help!
Thank you for this extremely helpful answer! I'll give it a try!
Thank you again for your suggested tips. I tried editing my gtf, removing the samview portion of my kallisto quant line, and re-running the pipeline. This time, I did it both the original way (trying to build index from plasmid+ transcriptome) as well as a simplified vdrsion (trying to build index from transcriptome .fa only). Unfortunately, both versions failed to align properly (e.g. sam output very similar to what I posted above with no chromosome names) and I could not visualize them on IGV. I'll focus on figuring out what's wrong with just the basic transcriptome alignment.
Regarding your other suggestion using samtools faidx - could you help me understand how this would enter into the pipeline? I see that it makes .fa.fai files, but if you aren't feeding .fai files into kallisto index, just .fa files (see section "producing index" for example), why would it matter they exist?
Thank you again for your help!