Gathering Files

Question

Visualizing alignment of kallisto .bam files to hg19 (and custom plasmids)

0

Entering edit mode

6.2 years ago

Kristin Muench ▴ 640

Hello,

I'm trying to visualize kallisto read alignment to a human genome plus an additional sequence using IGV. I can produce .bam and .bai files using kallisto, but when I open them in IGV alongside the built-in "hg19" option, I don't see anything:

example IGV window - no alignment where I would expect to see something

Gathering Files

First, I downloaded my reference human genome from here, along with a gtf:

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

wget 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/mrna.fa.gz'

Gtf downloaded from the UCSC genome browser using these instructions: hg38 annotation file (gtf) UCSC format

Then, I download my additional sequence, "plas_shp", which looks like this:

head plasmidFasta/plas_shp_up.fa
>plas_shp
GTCGACATTGATTATTGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATA
GCCCATATATGGAGTTCCGCGTTACATAACTTACGGTAAATGGCCCGCCTGGCTGACCGC
CCAACGACCCCCGCCCATTGACGTCAATAATGACGTATGTTCCCATAGTAACGCCAATAG
GGACTTTCCATTGACGTCAATGGGTGGAGTATTTACGGTAAACTGCCCACTTGGCAGTAC
ATCAAGTGTATCATATGCCAAGTACGCCCCCTATTGACGTCAATGACGGTAAATGGCCCG
CCTGGCATTATGCCCAGTACATGACCTTATGGGACTTTCCTACTTGGCAGTACATCTACG
TATTAGTCATCGCTATTACCATGGTCGAGGTGAGCCCCACGTTCTGCTTCACTCTCCCCA
TCTCCCCCCCCTCCCCACCCCCAATTTTGTATTTATTTATTTTTTAATTATTTTGTGCAG
CGATGGGGGCGGGGGGGGGGGGGGGGCGCGCGCCAGGCGGGGCGGGGCGGGGCGAGGGGC

Producing Index

I begin by producing my kallisto index. The code looks like this:

#declare Variables                                                                                                                          
humFa="myPath/hg19_transcriptome/*.fa"  
plasFa="myPath/plasmidFasta/*.fa"

 # run index gen
  kallisto index -i /myPath/humPlas_kallisto_transcripts_plasmids_hg19.idx $humFa $plasFa --make-unique

Producing .bam file

I modified the gtf file I downloaded above to add my extra sequences, so it looked like this:

tail myGtf.gtf
chr19_gl000209_random   hg19_refGene    stop_codon  145118  145120  0.000000    +   .   gene_id "NM_012312"; transcript_id "NM_012312"; 
chr19_gl000209_random   hg19_refGene    exon    145079  145743  0.000000    +   .   gene_id "NM_012312"; transcript_id "NM_012312"; 
plas_shp    AddedGenes  exon    1   12915   .   +   0   gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp"; tss_id "plas_shp";

Then I aligned kallisto to the index I produced above using the above:

# run index gen
kallisto quant -i $kallistoIdx -o $outputFileLoc/$sample --genomebam --gtf $myGtf $Read1 $Read2 | samtools view -bS > $sample_align.bam

When I convert the output .bam file to .sam, it looks like this:

NS500127:27:HW5JKBGXY:1:11101:17213:1141    131 *   0   0   76M *   0   0   GGCGTTGCTGCGGGTGGCCTGCGTGTGCCGCTTATGGAGGGAGTGTGTGCGCAGAGTATTGCGGACCCATCGGAGC    AAAAAEEEEEAEEEAEEEEE/AEE/AEEEEEEAEEEE/EEEEE/EAE/AAE//E/AEE/EEAEA//E/<A/EEEEA    ZW:f:0
NS500127:27:HW5JKBGXY:1:11101:24672:1142    67  *   0   0   76M *   0   0   TTTGATCTTCTTAGTTTTCTTCTTCTTATCCTTACCGCTGTCATCCTCCTCATCTGAACCCACATCTTCGATCTTG    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEAEEEEEEEEEAE/EEEEEEEEEEEE//EEEEA    ZW:f:0

The problem

When I try to open the .bam file in IGV, I don't see any alignments. I've tried doing the above with hg19/GRCh37 and 38 files from different locations. I've tried including and excluding the chromSizes file. I'm not sure what else to try.

Is there anything immediately obviously problematic about my approach?

Is there a viewer that might be more forgiving than IGV when it comes to visualizing kallisto .bam files?

Has anyone successfully managed to visualize their kallisto .bam files in IGV before?

Thank you for your help!

IGV kallisto RNA-Seq • 4.2k views

ADD COMMENT • link updated 6.2 years ago by pmelsted ▴ 130 • written 6.2 years ago by Kristin Muench ▴ 640

score 2 · Answer 1 · 2018-09-09

Everything up to the GTF file was correct. The only thing missing is that kallisto requires a gene entry for each transcript and the exons of each transcript. Also it's important the the field separators for the GTF are TAB characters, where as the key-value entries are separted by space and ;

plas_shp    AddedGenes  gene    1   12915   .   +   0   gene_id "plas_shp"; gene_name "plas_shp";
plas_shp    AddedGenes  transcript  1   12915   .   +   0   gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp";
plas_shp    AddedGenes  exon    1   12915   .   +   0   gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp"; tss_id "plas_shp";

this line kallisto quant -i $kallistoIdx -o $outputFileLoc/$sample --genomebam --gtf $myGtf $Read1 $Read2 | samtools view -bS > $sample_align.bam is unnecessary since kallisto with --genomebam doesn't write to standard output, but produces a sorted indexed BAM file under outputFolder/pseudoalignments.bam.

After this you should be able to view this using IGV. You need to either index your plasmid fasta file using samtools faidx or you can add them as extra chromosomes to the Human reference (but give them a new chromosome name and modify the GTF and run samtools faidx again).

I tested this on a small example

@r1
TGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATAGCCCATATATGGAGT
+
AAAAAEEEEEAEEEAEEEEE/AEE/AEEEEEEAEEEE/EEEEE/EAE/AAE//E/AEE/E
@r2
TGGGCATAATGCCAGGCGGGCCATTTACCGTCATTGACGTCAATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEA

using the head of the plasmid sequence you provided and got the following results kallisto quant -i ind -o out --genomebam --gtf plas.gtf --single -l 200 -s20 r1.fastq and looking at the output samtools view out/pseudoalignments.bam

r1  0   plas_shp    16  255 60M *   0   0   TGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATAGCCCATATATGGAGT    AAAAAEEEEEAEEEAEEEEE/AEE/AEEEEEEAEEEE/EEEEE/EAE/AAE//E/AEE/E    ZW:f:1
r2  16  plas_shp    272 255 45M *   0   0   TATTGACGTCAATGACGGTAAATGGCCCGCCTGGCATTATGCCCA   AEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA   ZW:f:1

score 0 · Answer 2 · 2018-09-08

0

Entering edit mode

6.2 years ago

swbarnes2 14k

Your bam has no indication whatsoever what chromosome each read goes to. So of course IGV can't put reads on the human chromosomes.

If you want to look at alignments in IGV, you need to align to the human genome with a real aligner, like STAR, not just count transcripts with kallisto.

ADD COMMENT • link 6.2 years ago by swbarnes2 14k

0

Entering edit mode

Thanks for pointing out where it's falling down. You actually can use IGV with kallisto now: https://pachterlab.github.io/kallisto/pseudobam.html

kallisto can now be used to visualize pseudoalignments w/ IGV. It outputs BAM directly in genome coordinates & produces indexed+sorted BAM so it bypasses the need for Samtools. I believe it's the fastest and simplest workflow for reads -> browser https://t.co/fDtFhMCSdi @pmelsted
— Lior Pachter (@lpachter) January 30, 2018

ADD REPLY • link 6.2 years ago by Kristin Muench ▴ 640

0

Entering edit mode

Thanks for pointing out where it's falling down. You actually can use IGV with kallisto now: https://pachterlab.github.io/kallisto/pseudobam.html

kallisto can now be used to visualize pseudoalignments w/ IGV. It outputs BAM directly in genome coordinates & produces indexed+sorted BAM so it bypasses the need for Samtools. I believe it's the fastest and simplest workflow for reads -> browser https://t.co/fDtFhMCSdi @pmelsted
— Lior Pachter (@lpachter) January 30, 2018

ADD REPLY • link 6.2 years ago by Kristin Muench ▴ 640

0

Entering edit mode

Thanks for pointing out where it's falling down. You actually can use IGV with kallisto now: https://pachterlab.github.io/kallisto/pseudobam.html

kallisto can now be used to visualize pseudoalignments w/ IGV. It outputs BAM directly in genome coordinates & produces indexed+sorted BAM so it bypasses the need for Samtools. I believe it's the fastest and simplest workflow for reads -> browser https://t.co/fDtFhMCSdi @pmelsted
— Lior Pachter (@lpachter) January 30, 2018

ADD REPLY • link 6.2 years ago by Kristin Muench ▴ 640