Hello,
I have a project that I had aligned in the past using kallisto version 0.46 and an index made from ensembl version 98. The mapping ratio was around 30%. I rerun that today using kalliston version 0.50 and an index made from ensembl version 111. The mapping ratio was around 1%. I created a new reference using ensembl version 98 with the new kallisto version and the mapping ratio was again around 30%.
The references that I use contain cDNA and ncRNA. Why would a new version reduce the alignment rate so much? Usually in the past with these types of project I would get a low alignment rate when aligning just to cDNA but it would increase when adding the ncRNA. Now it actually only increased by 0.1 when I added the ncRNA using ensembl version 111.
Thank you
It's suspicious, and makes me wonder whether you simply downloaded a wrong file or had some hiccup during the process. Can you please make a reproducible example in the sense that you share all download links for the files, and all relevant code?
Thank you for your help. I made the reference like this:
This was the output when I made the reference.
The number of targets in the reference is 145,855 while in the reference containing just cDNA is 115.911. Approximately, 18545 ncRNA targets are shared with the previous ensembl version and I checked the est_counts for several of them and they look pretty similar.
What happens if you combine the Ensembl 98 ncRNA with the Ensembl 111 cDNA?
I just tried this and the mapping ratio increases to 30% again. It does seem to be something with the Ensembl 111 ncRNA then. I have downloaded it several times and I keep getting the same issues. Do you think there may be a transcript in the Ensembl 98 where the data aligns a lot but it may be missing from Ensembl 111? When doing
head
on the files the formatting seems to be the same.I think that the quality of the samples was just bad. I tested it with some other samples and it worked fine with those.
Putting
kallisto
in your question implies that something is wrong with the program, while most likely the fault is with the files.I suggest you take a couple of genes that are known to map and compare them between reference files by literally copying the lines on top of each other. If there is any formatting inconsistency it should be easy to see. If there are no differences in coding parts, at least you would be narrowing it down to ncRNA annotations.