Entering edit mode
17 months ago
anikcropscience
▴
260
Hello,
Do you know a good way to map contigs/snp-regions from one genome to another?
At the moment, I would like to check the positions of the SNPs
that I obtained from one reference genome on the new preliminary scaffolded reference genome. I tried so far a few tools I would normally use, but they fail with a genome size of 33GB
. BLAST
cannot handle long sequences (the chromosomes) when creating a BLAST-db
. minimap2
gives an error, and bwa-mem2
gives a memory error. Do you have some suggestions?
Thank you.
Thank you. Crossmap indeed looks promising.
I used CrossMap and it's pretty straightforward. There was a gotcha with chromosome names, when it was renaming them in the header, but not in the body of the VCF, so bcftools afterwards complains about undefined contigs and generally won't touch the file. I fixed it by running CrossMap again with
--chromid s
which worked fine for my file (it was human data lifted over between hg38 -> hg19).Also don't bother with compression, CrossMap just calls gzip - it's better to compress your output with bgzip, so you can index it for quick random access ;) Look at tabix if you are not familiar with this.
I picked CrossMap after seeing it was used by the 1000 genomes folks to compare their latest data (high-coverage resequencing of phase 3 data mapped to hg38) with their older variant calls (most used hg19). I thought if it was good enough for them, it should be good enough for me too :D
I am working with plant genomes and this tool uses a chain, also GATK as well. Is there any alternative tool that you know of?
Hmm, first idea was minimap2 but you tried that - what error you got there btw? There is another similar tool LASTZ you may wanna try. But I think even standard approaches to compare assemblies exist.
I'd probably compare assemblies using various stats and then proceed to denovo variant calling on the new reference, rather than trying liftover - you can always lose information during liftover, while denovo calling is gonna get you the maximum information present in your data IMHO.
The error is related to memory usage since it is a 33GB genome. Yes, I can see the benefit of calling new SNPs on the new genome but we created and genotyped the samples using SNPs from the previous genome. This is why I am trying to use Liftover.
In principle, you should get the same information no matter if you call denovo or lift over - the end result should be the same. All I'm saying is that, in practice, variant callers are waaay more tested and robust than any liftover tool I've seen.
Your SNPs probably won't have the same coordinates anyway, so you cannot compare them directly. Instead, I'd probably look at higher-level information, like heterozygosity of samples, or some PCA/MDS clustering - if these look similar enough, then your new data are probably fine too.
I fully agree with you and I would have done the same thing probably. But I am just following a protocol and trying to achieve something through a tool that can deal with such a huge genome.
I get it, but I always wondered how much information got messed up by the liftover. I did MDS on both versions, which looked the same, so I moved on for a time, but I kept wondering if some issue later creeps up. Later we abandoned that dataset in favour of a different one, for mostly unrelated reasons, although I always considered a plus that the other dataset didn't go through a liftover.
I just think I would trust the output of a variant caller more.
Another question - is there significant benefit in moving to the new assembly? Can you just stay with the old assembly if all your samples use it?
Or do you rather just want some sort of "dictionary" to translate between the two references? For example to move samples (and some results) to the new reference when it gets good enough?
We thought about such things in my previous team but I think it was never done even after I left. I can ask though.
Yes, there is a significant benefit because the new assembly is scaffolded to the chromosome level with PacBIO+Hi-C technology. And we will use this new genome in future analysis.
But if new assembly is substantially different than the prior assembly you should reanalyze the data.
Ok thank you.
I see, it's similar to the situation in our old team. We worked for years with transcriptome-based reference and exome-capture data, publishing several papers on the data while working on whole genome sequencing in the meantime.
My supervisor wanted to be able to translate between the two references so he doesn't need to throw out all that built-up knowledge. But in the end, it needs to be repeated on the genome assembly denovo, or else you always wonder if something went wrong/suboptimal during various conversions. I don't know how far they got after I left, but I hope they just repeated necessary stuff denovo and produce new results on the new genome instead of constantly switching between the two contexts.