Entering edit mode
21 months ago
txema.heredia
▴
190
Hi,
I've been passed down some bam files from RNA-seq that I need to analyze. They were generated with pseudoalignments using Kallisto ( https://pachterlab.github.io/kallisto/about ) When I run htseq-count one specific bamfile crashes on a specific read:
Error occured when processing input (record #149548485 in file 103805-016-011.kallisto.pseudoalignments.bam):
Expected str, got NoneType [Exception type: TypeError, raised in _HTSeq.pyx:60]
After delving deeper I realized that the next read (or that specific one if it is 0-based) has some very weird formatting, at least to me:
[149548484] A00379:576:H7WK3DSX3:4:1101:5421:1016 77 * 0 0 * * 0 0 GNTCTTTTAAAAAGAGATTAAACCGAAGGTGATTAAAAGACCTTGAAATCCATGACGCAGGGAGAATTGCGTCATTTAAAGCCTAGTTAACGCATTTACTAAACGCAGACGAAAATGGAAAGATTAATTGGGAGTGGTAGGATGAAACAAT F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[149548485] A00379:576:H7WK3DSX3:4:1101:5421:1016 141 * 0 0 * * 0 0 CTTCCTACTTTTCAGGTTTAAATTTATCTTTTTTCTTCTAAAAGTATGTTTTTATCTTCTAATTTCCCTATCTTCTCTATTCTTTTCTTCGCCTTCCCGTACTTCTGTCTTCCAGTTTTACACTTCAAACTTCTATCTTCTCCAAATTGTT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF,FFFFFFF,FFFFF:FFF:FF,FFF:FF,F,FFFFF,:FF
[149548486] A00379:576:H7WK3DSX3:4:1101:13702:1016 67 * 0 0 151M * 0 0 ANAAATCTAGGCTCCATCAACACTGAATTGCAAGATGTGCAGAGGATCATGGTGGCCAATATTGAAGAAGTGTTACAACGAGGAGAAGCACTCTCAGCATTGGATTCAAAGGCTAACAATTTGTCCAGTCTGTCCAAGAAATACCGCCAGG F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF ZW:f:0
[149548487] A00379:576:H7WK3DSX3:4:1101:13702:1016 131 * 0 0 151M * 0 0 TATTTTCAGGAAACTGAGCTCACAGAGATGTGTATTAGAATCCAAGTGGAACTTCTGCCTCTAAAGACCTTGCAAGAAAAGAGATGCCCTGAAAATGAAAGGTTGCACCTCATTTAATGAAGCTTAACCCTATGTAGAAAGTCTCTTTCGG F:FFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFFF:FFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF,FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF ZW:f:0
[149548488] A00379:576:H7WK3DSX3:4:1101:14913:1016 77 * 0 0 * * 0 0 ANTATAACAAACCCTGAGAACCAAAATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAGGCCTACCCGCCGCAGTACTGATCATTCTATTTCCCCCTCTATTGATCCCCACCTCCAAATATCTCATCAACAACCGACTA F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,::,F::,FF:F:FFFFF,FF:,,::FF:F,F:,,:FFFFFF:F:F::F:FFFFFF,
[149548489] A00379:576:H7WK3DSX3:4:1101:14913:1016 141 * 0 0 * * 0 0 GTTTATAGATAGTTGGGTGGTTGGTGTAAATGAGTGAGGCAGGAGTCCGAGGAGGAGGTTAGTTGTGGCAATAAAAATGATTAAGGATACTAGTATAAGAGATCAGGTTCGTCCTTTAGTGTTGTGTATGGTTATCATTTGTTTTGAGGTT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:F:FFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFFFFFFFFFF::FFFF:F:FFFFFFFFFFFF::FFF,FFFFFFFFFFFFFFFFFFFF:FFFFFFFF,:F
The read pair at lines 149548486-149548487 are flagged as being mapped, but they aren't mapped to any known sequence (RNAME is *).
How can I prevent such error?
I cannot help with that error in particular (@dsull is involved in kallisto, he might help here) but more generally: kallisto BAM files have coordinates in transcriptome space afaik (as reads were quantified against a transcript fasta file) but HTseq uses a GTF for guidance, and that is probably in genome coordinate space. Are you aware of that? Don't you have the original kallisto quant output? It's a bit unusual to quantify the BAM files. A better way could be to revert BAM to fastq and just run kallisto again. It's quite fast and not memory-hungry.
Kallisto should do a smarter and more sophisticated job of counting genes than htseq-count, so this whole process seems backwards.
the point I think was that kallisto may be good at counting, but not so clear that is good at generating the correct BAM file. The pseudoBAM files are created from a kmer classification process, it is not so easy to create these correctly, especially when the various splicing, read strand, and clipping need to be accounted for.
I know that I have discovered several bugs in pseudoBAMs in the past.
Like ATpoint I would not recommend processing pseudoBAM files, these should be used for visualization. but if one needs actual counts they should be using the counting functionality rather than counting the BAM with HTSEQ
Thanks, I wasn't aware of that.
However, when checking the bam file header, I see it references to chromosomes 1-22 + X/Y/MT. And when checking the very last read at chromosome 1 it is mapped to a position very close the the chromosome end:
I do have, indeed, the abundance files from kallisto. However, they are quantifying transcripts and I was asked to analyze (at least to start with) differentially expressed genes through DESeq2.
These kallisto results, however, were produced using a very old annotation file:
I also have, for each sample, another bam file generated through histat2. However, I never used histat2 and I cannot find neither the command used to generate it, nor the annotation file version used.
The mapping should be, at least a little, splicing-aware. Exploring and comparing the results of the kalisto and histat2 bam shows reads spanning multiple exons. However, IGV doesn't show a junctions track for the histat2 file and it has much more "noise" in introns than the kalisto-generated file.
Could you help me quantify genes* for DESeq2 with the data I currently have?
what if the output of the following commands
and
Thanks,
When re-checking everything I realized that other samples also have a similar problem. For some weird reason running htseq-count in all samples it complains always about sample 011, but when running it with another single sample (001), it also crashes in a different (earlier!) line:
Which matches as before the very first line where FLAG == 67 or 131, RNAME == and CIGAR != in that sample:
The command you asked :
And the bam file header (identical in both files) has only chromosomes 1-22 + X/Y/MT:
I don't see anything specially wrong with these files. Do you?
flags 67 and 131 mean
how can it be proper pair when both align on the forward strand ... right there a contradiction
as I mentioned before pseudoBams are rarely robust - I advise against treating them as real BAM files,
a pseudoBAM is more of a convenience feature of the tool rather than a primary result
Thank you
I wanted to use htseq-count precisely because I know it can properly handle ambiguous paired reads and things like that. I was also handled "bedtools multicov" count files, but I don't trust those results at all.
My problem is... I don't have the original fastq files. I could generate them from the kallisto bams but... can I trust them?
To be honest I don't quite understand what you are trying to do. Why would you need to use htseq count in the first place? The bam file obtained from kallisto is not the kind of BAM file that a genomic quantification would produce. We use
htseq count
when the alignment reference is a genome. We cannot use the same tool when the alignment reference is a transcriptome.(Other people also already mentioned this issue in a comment)
What coordinates were you using anyway?
When you run kallisto, the sequence names in your pseudobam will be the transcript names already. You get the counts right away from that file. There is no need to run htseq to count anything,
For example
samtools idxstats
can tell you how many alignments you have for each transcript.TLDR: You can use
htseq-count
for BAM file generated withhisat2
alignment but not with the one generated withkallisto
, so getting an error there is maybe doing you a favor ;-) and keeps you from doing something incorrectTo me it looks like the kallisto bams are generated in chromosome space, not in transcriptome space. Yes, the gene lengths are totally wrong because they didn't add the -c flag when running kallisto:
Let me explain the situation and the analysis:
A colleague from the department approached me asking to do an "urgent" analysis for a collaboration of differentially expressed genes "similar to what you have already done for us before" (RNA-seq, STAR + htseq-count + DESeq2 analysis at the gene level across 2 conditions).
According to them (wet lab people, no bioinformatics background) it should be very fast because their collaborator had already sent them count files. Thus, I should only do the DEG analysis and not align the fastqs and yadda yadda.
The analysis is very underpowered. There are 2 control samples, 2 case samples with mutation A, and 1 case sample with mutation B. My colleagues were also hoping to get some extra controls generated by the same people, but we still don't know anything about it.
There is no communication between me and the source of this data. This data was generated by the clinical genetics department of a hospital, so they know what they are doing, but I am not familiar with the software they used.
What I ended up getting is:
According to an email, the count files were generated by running :
but I have no "proof" of the exact command used to generate them. I don't know neither which hisat2 bam file was used (pre- or post-splicing) nor what annotation file version was used. I was provided with a bed file with 54,475 gene/transcript entries, though.
Previously, for a different project, I had conflicting experiences with bedtools multicov as gene quantification method. Some poorly aligned/ambiguous reads will map across exons from two genes and multicov will count them in, while htseq-count will identify them as problematic and kept them out of the analysis. Because of that, I want to avoid using these provided counts files.
The kallisto abundance files look fine, but they are transcript-based. There are 180,254 different transcripts.
With such a small sample size in the project design, can I reliably test individual transcripts with 151bp paired-end reads?
However, I was asked to run gene-level DEG analysis. Can I really use these transcript-level counts to infer DE Genes? I have no experience with kallisto. Can I add up the counts of all transcripts of a gene to create a count of the gene? What about reads that could belong to multiple transcripts? Are they counted multiple times? Is PCR amplification corrected somehow?
Or should I just use tximport to load the kallisto abundance files into DESeq2 and follow a tutorial?
On top of all this, I do suspect that this data was sequenced originally with UMIs (the kallisto command used R1 & R3 .fastq files, and the sample naming convention matches that of a sequencing company I've used for other projects with UMIs to get rid of PCR-amplification bias). If my best option is to convert the kallisto pseudobams into fastqs, I'd better start from scratch with the original files.
As you see, I've been given a halfway project done using tools that I am not familiar with and I was trying to reverse-engineer what has been done and how can I make it work with the tools I'm used to.
Use tximport to get the kallisto files into DESeq. Bedtools is not the best way to get gene counts.
kallisto should not be run a genome sequence. It was not designed for that; I would be surprised if it worked.
I would guess your files are mislabeled as kallisto output when in fact were generated via hisat. Your Kallisto json shows a transcriptome file there.
If you have kallisto output can easily combine transcript-level counts into gene-level counts using tximport or many other techniques; no need to recount
Thank you all, I learned a lot.