Hi,
I'm trying to do SNP imputation using IMPUTE2. My genotype data (bim/bed/fam) is hg18, with the usual quality control, mostly European ancestry, reference panel is hg19.
My workflow:
- Determine flipped and ambiguous SNPs relative to hg18, using snpflip (https://github.com/biocore-ntnu/snpflip ).
- Use PLINK to create map/ped files, with flipped SNPs flipped and ambiguous removed.
- Computed liftover to hg19 using liftOverPlink.py (https://github.com/sritchie73/liftOverPlink ).
Using gtools to transform ped/map to gen/sample. Final SNP counts per chromosome in the .gen file:
63234 1 64938 2 53168 3 49266 4 49529 5 49570 6 41084 7 42120 8 35963 9 42686 10 39260 11 37593 12 30342 13 24798 14 22896 15 23652 16 18379 17 23539 18 10569 19 20286 20 11094 21 10292 22
Using IMPUTE2 with hg19 reference panel (https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html ), chunk size 5000000, -Ne 20000 and -filt_rules_l 'EUR==0'.
I'm parsing the top-right entry in the concordance table (see https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#concordance_tables ), using
cat $summaryfile | grep -A1 "Concordance" | grep -v "Concordance" | tr -s ' ' | cut -d ' ' -f 8,9 | grep -v -- -- | tr -d '\n'; echo
When plotting these values by chromosome, it turns out that chr1 has good concordance as expected (~95%), whereas all others are pretty bad:
The same happens when not filtering for EUR.
I'm at a loss here, any idea what could be causing this?