Hello,
I am analysing some RNA-seq data with kallisto and I have created a series of sam files using --pseudobam
in kallisto.
Is there a way of converting these files in a format that can be easily viewed on IGV?
Thank you in advance! :)
Hello,
I am analysing some RNA-seq data with kallisto and I have created a series of sam files using --pseudobam
in kallisto.
Is there a way of converting these files in a format that can be easily viewed on IGV?
Thank you in advance! :)
Per @Devon's answer here, you can use samtools to sort your SAM and convert it to BAM.
samtools view -u alignment.sam | samtools sort -@ 4 - output_prefix
Edit: The question is referring to transcriptomic vs genomic mappings for IGV
There's a python script available here found in this answer from SeqAnswers.
Usage:
python3 kallisto_sam_convertor.py pseudoalignment.sam annotation.gtf | samtools view -bS - | samtools sort - -o <output.bam>
bam should be loadable to IGV.
--genomebam
Introduced in kallisto v0.44.0 is a --genomebam
flag, which has quant
generate a genome aligned BAM. You must provide an annotation GTF file via the --gtf
flag. It is also recommended to include a chrom.sizes file with the --chromsomes
flag.
It might be worth mentioning that I wasn't able to immediately use GENCODE's files for mm10, e.g., gene GTF and transcripts FASTA. The kallisto indexer interpreted the full line from the FASTA read headers as the transcript IDs, but since the headers include much more, the pseduobam procedure couldn't resolve them to a transcript ID in the GTF. It still generated the full BAM, but everything was aligned to "*". Reprocessing the FASTA to only include the transcript ID (everything up to first "|") and reindexing on that got everything working.
Another issue I've come across is a mismatch between the chromosome names in the GTF and those in the chrom.sizes file. Kallisto will throw a warning about not finding chromosomes for transcripts if this is the case. A simple test to see if this is the issue is to run without a --chromosomes
flag, in which case, kallisto defaults to simply using the chromosome names in the GTF.
If you are generating your own GTF, it might be worth having a look at some compatibility guidelines I posted over on the kallisto repository. Simple sorting by seqname + start is not sufficient, e.g. exons must be rank-ordered.
The gencode annotation was the issue for me as well.
The following command edits the gencode fasta read headers to only include the first transcript field:
sed -r 's/^(>[^\|]+)\|.*/\1/g' gencode.v28.transcripts.fa > gencode.v28.transcripts.onlyTxLineHeaders.fa
After reindexing on this fasta, the "mapping to genome" step worked, and the output files had chromosome names and positions included.
It is optional, but if you want an accurate header in the BAM you should use the --chromosomes
flag. I tried running without it, and this resulted in all chromosomes being assigned LN:536870911
in the header. A chrom.sizes file is usually readily available. E.g., simple searching "mm10 chrom.sizes" gives me the file in the first link.
Is this working properly? pseudobam and the script are not working for me.
Hi, the author is here. I wrote this script for one use only (I was not trying to make a very general resource for everybody, but maybe I should).
The problem is that you got differently formatted gtf file and I am using instead of some implemented parsers simply my own. These are lines parsing a gtf file into two dictionaries of transcript pointing to their position in genome (one dictionary for + strand and one for -):
pos_tr = {}
neg_tr = {}
with open(sys.argv[2],'r') as anotationfile:
for entry in anotationfile:
entry = entry.split('\t')
key = entry[8].split('"')[1]
if entry[6] == '+':
pos_tr[key] = entry[3]
else:
neg_tr[key] = entry[3]
The quick solution would be just to find how to read the name of transcript and replace entry[8].split('"')[1]
by by an expression what will get a transcript name for your version of gtf.
I also found that there are two bits, that will rename your reference to S5_genome
. lines 21 - 23 :
if samline[:3] == '@HD':
print(samline, end='')
print("@SQ SN:S5_genome LN:6778833")
and line 30
print('S5_genome', end="\t")
This was not intended as general utility, but a solution for a project during genomics course where we wanted to be able to inspect the job of kallisto (that's why the repo is called after the course : sequence a genome).
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
In addition..
Is that bam as accurate and useful than those obtained with splicing aware mappers like STAR o HISAT ?
Not really, no. As the pseudo alignment will only assign transcriptomic reads that it can definitively say belong to a given transcript, and the abundance estimation then uses an EM (expectation maximisation) function to basically fill in the gaps.