Hi! I have received a GWAS dataset to analyze mitochondrial haplogroups that I pretend to do with HaploGrep2. As I understand HaploGrep2 uses rcrs as reference. My data have been mapped against human genome build 37, but the Illumina technical support told me that both, build 37 and build 38 of the human genome include rcrs. As I found some different opinions in the internet I was wondering if there exist any way to confirm which version of the mitochondrial genome was used to map my dataset. Has anybody have some advice on how to confirm that? I have already analyzed my data with HaploGrep2, and the results make sense, despite that they have a poor quality. Anyway, I would like to make shure that I am analyzing correctly. Any advice would be great!!!!
Thanks, Jeremy for your answer! Yes, my data is from microarray. I have already checked manually some positions in my plink file to see if the genotypes agree with those found in rCRS/NC_012920.1, and all them agree. Also, I recently saw in the chip manifest that several variants were obtained from rCRS, so I imagine that all them have been genotyped according to these positions. I have already run HaploGrep2 with my data converted to vcf and the haplogroups obtained matched the common haplogroups found by others in my population. The only problem is a high rate of individuals not classified to any haplogroup. I donĀ“t know if it is common to get samples not classified with HaploGrep2. Do you have any information about that? Thanks again! Iriel
sounds like something is wrong. you'd have to show us an unmatched sequence to know what is happening.
How can I show you one of my genotypes? It seems that I cannot post more than 5000 characters.
just post the non-header lines from the VCF. a mitochondrial genome shouldn't have more than a few variants.
POS REF ALT 5 72 A G 0/0 93 A G 0/0 125 A . 0/0 215 A G 0/0 217 A . 0/0 217 A . 0/0 228 G A 0/0 228 G A 0/0 236 A G 0/0 285 G . 0/0 295 G A 0/0 295 G A 0/0 418 G A 0/0 456 G A 1/1 477 A G ./. 489 A G ./. 499 G A ./. 606 A . 0/0 629 A . 0/0 710 A G 0/0 711 A . 0/0 722 G . 0/0 750 G A 0/0 921 A G 0/0 951 G A 0/0 961 A G 0/0 1005 A . 0/0 1018 G A 0/0 1041 A . 0/0 1048 G A 0/0 1243 A G 0/0 1382 A . 0/0 1393 G A 0/0 1406 A . 0/0 1413 A . 0/0 1438 G A 0/0 1442 G A 0/0 1664 G . 0/0 1694 A . 0/0 1700 A . 0/0 1703 G . 0/0 1706 G A 0/0 1709 G A 0/0 1717 A . 0/0 1719 G A 0/0 1721 G A 0/0 1736 A G 0/0 1811 A G 0/0 1842 A . 0/0 1888 G A 0/0 2056 G A 0/0 2218 G . 0/0 2259 G . 0/0 2283 G A 0/0 2332 G A 0/0 2352 A G 0/0 2358 A G 0/0 2416 A G 0/0 2483 A . 0/0 2706 G A 0/0 2758 G A 0/0 2768 A G 0/0 2772 G . 0/0 2831 G . 0/0 2880 A . 0/0 2885 A G 0/0 2885 A G 0/0 3010 G A 0/0 3010 G A 0/0 3027 A . 0/0 3116 G . 0/0 3197 A G 0/0 3200 T A 0/0 3206 G A 0/0 3210 G A 0/0 3316 G A 0/0 3338 A G 0/0 3348 A G 0/0 3384 A . 0/0 3394 A G 0/0 3396 A G 0/0 3398 A . 0/0 3434 A G 0/0 3497 G . 0/0 3547 A G 1/1 3571 G . 0/0 3644 A G 0/0 3645 A . 0/0 3666 G A 0/0 3720 A . 0/0 3736 G A 0/0 3746 G A 0/0 3826 A G 0/0 3866 A G 0/0 3921 G . ./. 4021 A G 0/0 4023 A . 0/0 4024 A . 0/0 4025 G A 0/0 4093 A . 0/0 4123 A G 0/0 4129 A . 0/0 4312 G A 0/0 4335 G . 0/0 4336 A G 0/0 4363 A . 0/0 4732 A G 0/0 4769 G A 0/0 4820 G A 1/1 4917 A G 0/0 4917 A G 0/0 4977 A G 1/1 5004 A . 0/0 5108 A . 0/0 5301 A . 0/0 5331 C A 0/0 5390 A . 0/0 5442 A G 0/0 5460 G A 0/0 5814 A G 0/0 5833 A . 0/0 6221 A G 0/0 6248 A G 0/0 6253 A G 0/0 6260 G A 0/0 6261 G A 0/0 6267 G A 0/0 6285 G A 0/0 6392 A . 0/0 6446 G A ./. 6480 G A 0/0 6663 A G 0/0 6680 A G 0/0 6734 G A 0/0 6982 A . 0/0 7142 A G 0/0 7175 A G 0/0 7274 G A 0/0 7476 G . 0/0 7684 A G 0/0 7697 G A 0/0 7853 G A 0/0 7859 G . 0/0 7870 A . 0/0 7972 A . 0/0 8020 G A 0/0 8414 G A 0/0 8417 G A 0/0 8460 A G 0/0 8478 G . 0/0 8502 A . 0/0 8684 G . 0/0 8701 A G 0/0 8857 G A 0/0 8859 G . 0/0 8896 G . 0/0 8946 A G 0/0 8964 G . 0/0 9064 G . 0/0 9072 A G 0/0 9093 A . 0/0 9098 A . 0/0 9140 G . 0/0 9377 A G 0/0 9477 G A 0/0 9540 A G 0/0 9587 A . 0/0 9667 A G 0/0 9716 A G 0/0 9785 G . 0/0 9804 G A 0/0 9855 A . 0/0 9899 A G 0/0 9903 A . 0/0 9950 A G 1/1 9966 G A 0/0 10007 A . 0/0 10031 A G 0/0 10034 A G 0/0 10034 A G 0/0 10086 A G 0/0 10118 A . 0/0 10143 G A 0/0 10238 A G 0/0 10310 G A 0/0 10321 A G ./. 10345 A G 0/0 10397 A . 0/0 10410 A G 0/0 10463 A G 0/0 10550 A G 0/0 10586 G A 0/0 10589 G A 0/0 10607 G . 0/0 10609 A . 0/0 10640 A G 0/0 10688 G A 0/0 10754 A . 0/0 11025 A . 0/0 11061 G . 0/0 11172 A . ./. 11177 G A 1/1 11204 A . 0/0 11251 A G 0/0 11253 A G 0/0 11288 G . 0/0 11377 G . 0/0 11467 A G 0/0 11536 G . 0/0 11560 A . 0/0 11696 G . 0/0 11899 A G 0/0 11914 G A 0/0 11959 A . 0/0 11963 G A 0/0 11969 G . 0/0 11984 A G 0/0 12121 A . 0/0 12236 G A 0/0 12248 A G 0/0 12285 A . 0/0 12308 A G 0/0 12454 G A 0/0 12542 G A 0/0 12630 G A 0/0 12642 A . 0/0 12669 G A 0/0 12705 A G 1/1 12771 G A 0/0 12820 G . 0/0 12822 A . 0/0 12850 A G 0/0 12937 A . 0/0 12940 G A 0/0 12950 A G 0/0 13105 A G 0/0 13105 A G 0/0 13135 G A 0/0 13183 A . 0/0 13215 A . 0/0 13263 A G 0/0 13276 A G 0/0 13650 G A 0/0 13708 G A 1/1 13759 G A 0/0 13780 A G 1/1 13789 A G 0/0 13880 C A 0/0 13886 A G 0/0 13924 G A 0/0 13933 A . 0/0 13934 G A 0/0 13942 A . 0/0 13958 G C 0/0 13965 A G 0/0 13966 A G 0/0 13981 G . 0/0 14000 A T 0/0 14025 A . 0/0 14053 A G 0/0 14059 A G 0/0 14148 A G 0/0 14178 A G 0/0 14182 A G 0/0 14258 G . 0/0 14318 A G 0/0 14323 G . 0/0 14338 G . 0/0 14417 A . 0/0 14502 A . 0/0 14512 A . 0/0 14544 G A 0/0 14569 G A 0/0 14687 A G 0/0 14793 A G 0/0 14798 A G 0/0 14861 G A 0/0 14872 G A 0/0 14890 A G 0/0 14893 A . 0/0 14927 A . 0/0 14978 A G 0/0 14979 A . 0/0 15038 A . 0/0 15043 G A 0/0 15071 A . 0/0 15074 A . 0/0 15077 G A 0/0 15119 G A 0/0 15148 G A 0/0 15172 G A 0/0 15204 A . 0/0 15218 A G 0/0 15236 A G 0/0 15257 G A 0/0 15263 G . 0/0 15301 G A 0/0 15314 G A 0/0 15323 G A 0/0 15326 G A 0/0 15381 G . 0/0 15402 G . 0/0 15431 G A 0/0 15452 C A 0/0 15479 A G 0/0 15487 T A 0/0 15497 G A 0/0 15514 A G 0/0 15662 A . 0/0 15670 A G 0/0 15672 A G 0/0 15693 A G 0/0 15758 A G 0/0 15784 A G 1/1 15812 G A 0/0 15824 A G 0/0 15833 G A 0/0 15849 G A 0/0 15860 A . 0/0 15889 A . 0/0 15904 G A 0/0 15904 G A 0/0 15905 A G 0/0 15907 A . 0/0 15924 A G 0/0 15924 A G 0/0 15927 G A 0/0 15928 G A 0/0 15928 G A 0/0
I had to eliminate a few positions due to the character limit, and the columns ID, INFO (PR) and FORMAT (GT). The chip variant IDs are: ID 2010-08-MT-841 2010-08-MT-981 200610-102 2010-08-MT-550 exm-rs41531144 MitoT217C exm-rs41323649 MitoG228A 200610-105 exm2263307 exm-rs41528348 MitoC295T exm2216184 MitoC458T MitoT479C exm-rs28625645 exm-rs3901846 exm2216185 exm2216186 exm2216191 2010-08-MT-830 200610-42 MitoG752A exm2216198 exm2216200 exm2216201 exm2216202 exm2216203 2010-08-MT-27 exm2216204 exm2216208 exm2216209 exm2216210 exm2216211 exm2216212 exm2216213 exm2216214 exm2216216 exm2216217 exm2216218 exm2216219 exm2216220 exm2216221 exm2216222 exm-rs3928305 2010-08-MT-526 MitoA1738G exm2216228 exm2216231 exm2216232 exm2216234 exm2216238 exm2216239 exm2216240 exm2216241 exm2216242 exm2216243 exm2216244 MitoC2485T exm2216249 exm2216250 exm2216251 exm2216252 exm2216254 200610-2 exm2216255 MitoT2887C exm2216256 MitoG3012A exm2216257 exm2216258 exm2216259 exm2216260 exm2216261 exm2216262 exm-rs2853516 exm2216265 MitoA3349G 200610-4 exm2216266 200610-111 exm2216267 exm2216268 exm2216269 exm2216271 exm2216272 200610-112 200610-113 MitoG3667A MitoA3721G exm2263308 exm2216273 200610-114 exm2216274 200610-45 exm2216277 200610-115 MitoA4025G 200610-47 exm2216280 exm2216281 exm2216282 exm2216284 200610-48 exm-rs41456348 exm2216286 exm2216294 MitoG4770A MitoG4821A exm2263338 MitoA4918G MitoT4978C MitoT5005C 200610-120 200610-7 exm2216303 MitoA5391G MitoT5443C exm-rs3021088 exm2216317 200610-9 MitoT6222C 200610-121 exm2216322 MitoG6261A exm2216323 exm2216324 200610-80 200610-123 200610-81 exm2216325 exm2263312 MitoT6681C 200610-82 exm2263313 2010-08-MT-831 MitoT7176C MitoC7275T exm2216332 2010-08-MT-860 2010-08-MT-861 2010-08-MT-874 200610-86 200610-128 200610-13 200610-87 exm2216340 exm2216341 exm2216342 exm2216343 200610-14 exm2263319 exm2263340 exm2263320 200610-55 exm2216361 200610-18 200610-56 200610-89 MitoA9073G MitoA9094G exm2216368 200610-57 MitoG9378A exm-rs2853825 MitoC9541T 200610-21 MitoA9668G MitoT9717C 200610-59 exm2263321 exm2216376 MitoT9900C exm2216377 MitoT9951C exm2216378 exm2216379 exm2216380 exm2263322 MitoT10035C exm2216384 200610-133 exm2216385 MitoT10239C MitoG10311A exm2216386 exm2216387 200610-22 exm2216389 exm-rs28358279 MitoA10551G MitoG10587A MitoG10590A 200610-60 exm2216391 200610-135 MitoG10689A 200610-23 exm2216394 200610-61 exm2216396 exm2216397 exm2216398 MitoA11252G exm2216399 200610-63 MitoG11378A MitoA11468G 200610-65 2010-08-MT-82 exm2216400 MitoT11900C MitoG11915A 200610-25 exm2216401 exm2216402 exm2216403 200610-136 exm2216406 exm2216407 200610-137 exm-rs2853498 exm2216417 exm2216418 MitoG12631A 2010-08-MT-143 MitoC12670T MitoT12706C 200610-94 exm2263325 200610-27 MitoG12851A exm2263326 exm2216420 exm2216421 exm2216422 MitoA13106G exm2216423 200610-30 200610-138 MitoA13264G exm2216424 MitoC13651T exm-rs28359178 2010-08-MT-204 exm2263327 exm2216429 exm2216431 exm2216432 exm2216433 exm2216434 exm2263328 exm2216436 exm2216437 MitoT13966C exm2263329 exm2216439 exm2216440 200610-139 exm2216441 exm2216442 exm2216443 exm2216444 2010-08-MT-232 exm2216446 exm2216447 200610-96 200610-71 2010-08-MT-251 exm2216448 200610-141 200610-97 200610-98 exm2263330 exm2263331 exm-rs28357681 exm2216456 2010-08-MT-288 200610-33 200610-34 exm2216457 exm2216458 exm2216459 exm2216460 MitoG15044A exm2216461 exm2216462 exm2216463 exm2263332 200610-99 200610-100 exm2216465 exm-rs2853506 exm2216467 exm2216468 exm2263333 MitoA15302G exm2216470 exm2216471 exm2216472 exm2216473 exm2216474 exm2216475 exm-rs3088309 exm2216478 200610-35 exm2216479 200610-143 exm2216481 MitoT15671C exm2216482 exm2263334 exm-rs41337244 MitoT15785C exm2216485 exm2216486 MitoC15834T exm2216487 exm2216489 exm2263335 exm2263341 MitoC15905T exm2216491 exm2216492 exm-rs2853510 MitoA15925G exm2216494 exm2263337 MitoG15929A MitoG15931A exm2216495 exm2263336 exm2216496 exm2216497 exm2216498 exm2216499 exm2216500 exm2216501 2010-08-MT-395 2010-08-MT-398 MitoG16130A MitoT16145C MitoG16146A MitoA16163G MitoA16164G 200610-146 MitoC16272T exm-rs41378955 MitoG16393A 2010-08-MT-511 200610-37
So the only variant lines I see are the 1/1's, correct? 0/0 means reference, ./. means no coverage Then the reference allele is not correct for some of them. In rCRS
This would make sense if they were all strand complements, but 12705 would be a 'G' but you called it an 'A'. I'm not sure what's going on. Can you provide the name of the Illumina chip?
Here are some more issues, pointing to a wrong reference as well:
Yes, 1/1 means a difference from reference as I understand. This file was converted with plink from a plink (.ped) file obtained from microarray, so some variants are genotyped according to different strands of the reference. But I run HaploGrep with the chip flag. Do you think it does not consider these differences in strands?
Haplogrep is not going to understand that some of your positions are on the complementary strand. I would do one of the following (3 choices): 1) try to convert your VCFs into something that has the correct reference and alternate at all positions relative to the + strand 2) generate a hsd file as described here: https://github.com/seppinho/haplogrep-cmd in which case you don't need the reference but your alternate allele must be correct 3) inject your variants into a rCRS fasta reference file
That's correct Jeremy, HaploGrep does not perform a remapping/liftover here - so there's a strand issue - as both heavy and light-strands seem to be mixed in the mtSNPs on this MicroArray. Interestingly the Microarray is designed detecting purines (A/G) and only for transversions pyrimidines (C/T).
HaploGrep also accepts HSD files without alternate allele, by simply providing the SNP position, however assuming a transition in this case, transversions always need to provide the alternate base.
While I agree that HaploGrep needs to be improved to detect such strand-issues, as a quick workaround you should try to make a remapping of the VCF file so that the reference base and alternative bases are concordant to the rCRS -http://phylotree.org/resources/rCRS_annotated.htm
I have switched the complementary strands and corrected the ancestral and alternative alleles of my vcf file to agree with rCRS. I think the problem was that my files were created with plink, which assigns ancestral and alternate alleles according to their frequency in the sample, and not to the real reference. I corrected that in plink and then I run HaploGrep again. All samples were classified this time, with quality scores from 0.500 (only one sample) to 1. I have read in papers that the quality score is relative to a complete sequenced mitogenome so the quality scores should be low for chip generated data. That applies also to the last version of HaploGrep or not?
Hi! This sounds good, great to hear that. When you run haplogrep with the chip flag, the scores get better - it ignores the sites which are not investigated, the number of expected sites for a haplogroup gets limited to the one found on the microarray. This is not comparable to a complete-sequencing, as you can get a number of best matches (cluster). However it gives you a good estimation on how well the data can be used for haplogroup classification here. 0.5 does not neccessarily mean bad - if the sample really is a H2a2a1 haplogroup, it will not have any SNPs, but if there are a lot of those, it is an hint that something went wrong with genotyping.
Thanks a lot for your help, Jeremy! I will try to create the hsd file and check my results.
Yes, the chip was Infinium CoreExome-24 v1.1