Hello All,
I need your help in creating a reference panel (customized) which can be used for imputation purpose. I have around 100 VCF files and i am interested in creating my own reference panel for analysis, it would be great if you can share the knowledge on how to create this panel or some kind of tutorials on it. For example, we generally use the following reference panel, 1000 Genomes Phase 3 (2,535 samples), 1000 Genomes Phase 1 (1,094 samples), HapMap2 (269 samples), Haplotype Reference Consortium (32,914 samples) etc. It would be great if you can share your thoughts on this, it would be highly appreciated.
Regards, Dav
Thanks a lot Kevin, i am so happy to see a solution, at last! I will try doing this. hopefully i will create my own reference panel. I will go through the impute2 tutorials. I have a question- this reference panel will be compatible with Michigan Imputation Server?
Yes, the home page states that they support VCF format, so, I would follow the 3 step process that I mention above and get your files in a single large VCF. if you need help with code for this, then let me know. I do this routinely.
Thanks a Lot Kevin, It would be great if you can help me with the script. I would be thankful for this help.
For merging them all, you can try to follow this. It assumes that your VCFs are in
/home/MyDir/Raw/
. You will also need a FASTA reference genome whose build matches that of the one used to call variants in your VCF (it's possible to avoid this step though - you'll come across it).This:
Get a list of all VCF samples
Loop through the files and bgzip/tabix each
Get a list of all gzVCF samples
Normalise all files and store as BCF
Get a list of all bcf samples
Merge all BCFs
Thanks a Lot Kevin, I am great full to you. thanks a lot ..
So, is the final file "ProjectMerge.vcf" the reference panel. Before, I go into a long process do you have a small snippet of "reference panel" to share. I would like to look into the data structure.
Thanks,
Yes, it is just a large VCF containing all variants called in the healthy controls or, to be more specific, the samples that you choose as your reference panel for imputation. It just looks like any standard VCF with multi-samples. Sorry, I cannot show a screenshot right now as I'm on the wrong laptop.
Oh, alright. Any other time is good.
Hi Kevin and Kirannbishwa01, I am a newbie in bioinformatics and trying to create a reference panel following the steps here. The above steps work well except for this step involving the Perl scripts "When you have your samples in a single VCF, convert them to GEN format for IMPUTE2 with this script (https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#scripts)". Any code examples of how you proceeded will be helpful
Hi, this thread is quite old, and the IMPUTE pages are even older. What I believe you should do is:
thanks, Kevin for the advice. I will try it out.
Hi Kevin!
I have 80 samples of GBS data. I have called variants through GATK pipeline. Now I have to perform imputation.
Do I need to use these 80 samples for building reference panel?
Thanks for your help!