I'm trying to do some CNV calling from Illumina Omni2.5 array data using penncnv.
My array data is from a few different 'versions' of the Omni2.5 arrays (some from 2011, some from 2013 and some from 2014) so the SNP names have changed between versions as dbSNP is updated. I've tried combining the 2011 samples with the later samples in GenomeStudio, to get everything consistent but it just ends up dropping the 99% sample call rates down to 88%. I'm unsure if this drop is due to inherent differences in the data, or because newer array versions have more targets. I figured I'd be able to get by, by changing SNP names to Chr:Position to have a consistent factor between the different array versions.
Penncnv requires a population frequency of B allele (pfb) file created from a few hundred population samples. We don't have a large group of population samples so I thought to use the 1000 GP EUR Omni2.5 data. After several days of manhandling the data through GenomeStudio I've come to making the pfb file and realised that the 1000 GP data is hg18 coordinates. Using UCSC's liftover was my first port of call, but this removes ~1000 SNP locations that have been deleted between genome versions, so the order of the output is completely different to the input meaning I can't match supplementary columns back to the lifted over genome coordinates. (and throws my idea of using Chr:Position instead of rsIDs out the window).
So my questions are:
- How do I get 1000 GP Omni2.5 data from hg18 to hg19 format?
and related to this:
- How do I handle multiple versions of Omni2.5 array data? Using a newer SNP manifest file in GenomeStudio does not do the trick.
I've started a conversation with Illumina Tech support and our Rep about this but currently their response stands at:
"We realise the problems associated with using chips from different versions within a project and generally recommend where possible to use a single chip version per project."
(which of course is great in a perfect world, but this is science)
If I get a solution from Illumina I'll post it below.
Thanks muchly for any thoughts / ideas!
~ Nick
Did you have any answer (I have the same problem now)
I never got a clear resolution to the problem and largely abandoned that analysis. Not sure if you're problem pertains to PennCNV or to Illumina genotype data?
What I would do now, is identify whether the 1KGP OMNI data can be reclustered in GenomeStudio (or using an Bioconductor based alternative in R) using the most up to date .bpm and .egt files from Illumina for that specific array type (likely that there's hg19 versions somewhere).
In terms of PennCNV as my samples (and probably yours) aren't on the same array as the 1KGP publicly available data it's difficult to create the pfb file, and there would likely be batch effects anyway. If there's a pfb that PennCNV provides then maybe use that?