Question

Why am I getting such a big difference in alignment rates in kallisto?

0

Entering edit mode

6 months ago

bioinfo ▴ 150

Hello,

I have a project that I had aligned in the past using kallisto version 0.46 and an index made from ensembl version 98. The mapping ratio was around 30%. I rerun that today using kalliston version 0.50 and an index made from ensembl version 111. The mapping ratio was around 1%. I created a new reference using ensembl version 98 with the new kallisto version and the mapping ratio was again around 30%.

The references that I use contain cDNA and ncRNA. Why would a new version reduce the alignment rate so much? Usually in the past with these types of project I would get a low alignment rate when aligning just to cDNA but it would increase when adding the ncRNA. Now it actually only increased by 0.1 when I added the ncRNA using ensembl version 111.

Thank you

kallisto ensembl • 674 views

ADD COMMENT • link 6 months ago by bioinfo ▴ 150

0

Entering edit mode

It's suspicious, and makes me wonder whether you simply downloaded a wrong file or had some hiccup during the process. Can you please make a reproducible example in the sense that you share all download links for the files, and all relevant code?

ADD REPLY • link updated 6 months ago by Ram 44k • written 6 months ago by ATpoint 86k

0

Entering edit mode

Thank you for your help. I made the reference like this:

wget https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/ncrna/Mus_musculus.GRCm39.ncrna.fa.gz
wget https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/cdna/Mus_musculus.GRCm39.cdna.all.fa.gz
cat Mus_musculus.GRCm39.cdna.all.fa.gz Mus_musculus.GRCm39.ncrna.fa.gz > mousev111cdnancra.fa.gz
kallisto index -i Mouse_v111_cdna_ncRNa mousev111cdnancra.fa.gz

This was the output when I made the reference.

[build] loading fasta file mousev111cdnancra.fa.gz
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 867 target sequences
[build] warning: replaced 1 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2)
KmerStream::KmerStream(): Finished
CompactedDBG::build(): Estimated number of k-mers occurring at least once: 116992695
CompactedDBG::build(): Estimated number of minimizer occurring at least once: 28810656
CompactedDBG::filter(): Processed 239488219 k-mers in 145855 reads
CompactedDBG::filter(): Found 116928824 unique k-mers
CompactedDBG::filter(): Number of blocks in Bloom filter is 799756
CompactedDBG::construct(): Extract approximate unitigs (1/2)
CompactedDBG::construct(): Extract approximate unitigs (2/2)
CompactedDBG::construct(): Closed all input files

CompactedDBG::construct(): Splitting unitigs (1/2)

CompactedDBG::construct(): Splitting unitigs (2/2)
CompactedDBG::construct(): Before split: 810111 unitigs
CompactedDBG::construct(): After split (1/1): 810111 unitigs
CompactedDBG::construct(): Unitigs split: 1799
CompactedDBG::construct(): Unitigs deleted: 0

CompactedDBG::construct(): Joining unitigs
CompactedDBG::construct(): After join: 750346 unitigs
CompactedDBG::construct(): Joined 60129 unitigs

[build] building MPHF
[build] creating equivalence classes ...
[build] target de Bruijn graph has k-mer length 31 and minimizer length 23
[build] target de Bruijn graph has 750346 contigs and contains 117032715 k-mers

The number of targets in the reference is 145,855 while in the reference containing just cDNA is 115.911. Approximately, 18545 ncRNA targets are shared with the previous ensembl version and I checked the est_counts for several of them and they look pretty similar.

ADD REPLY • link 6 months ago by bioinfo ▴ 150

0

Entering edit mode

What happens if you combine the Ensembl 98 ncRNA with the Ensembl 111 cDNA?

ADD REPLY • link 6 months ago by dsull ★ 7.0k

0

Entering edit mode

I just tried this and the mapping ratio increases to 30% again. It does seem to be something with the Ensembl 111 ncRNA then. I have downloaded it several times and I keep getting the same issues. Do you think there may be a transcript in the Ensembl 98 where the data aligns a lot but it may be missing from Ensembl 111? When doing head on the files the formatting seems to be the same.

ADD REPLY • link 6 months ago by bioinfo ▴ 150

0

Entering edit mode

I think that the quality of the samples was just bad. I tested it with some other samples and it worked fine with those.

ADD REPLY • link 6 months ago by bioinfo ▴ 150

0

Entering edit mode

Putting kallisto in your question implies that something is wrong with the program, while most likely the fault is with the files.

I suggest you take a couple of genes that are known to map and compare them between reference files by literally copying the lines on top of each other. If there is any formatting inconsistency it should be easy to see. If there are no differences in coding parts, at least you would be narrowing it down to ncRNA annotations.

ADD REPLY • link updated 6 months ago by Ram 44k • written 6 months ago by Mensur Dlakic ★ 28k