Get a list of all VCF samples

Question

Custom Reference panel creation for data imputation from .vcf files

3

Entering edit mode

7.7 years ago

David_emir ▴ 500

Hello All,

I need your help in creating a reference panel (customized) which can be used for imputation purpose. I have around 100 VCF files and i am interested in creating my own reference panel for analysis, it would be great if you can share the knowledge on how to create this panel or some kind of tutorials on it. For example, we generally use the following reference panel, 1000 Genomes Phase 3 (2,535 samples), 1000 Genomes Phase 1 (1,094 samples), HapMap2 (269 samples), Haplotype Reference Consortium (32,914 samples) etc. It would be great if you can share your thoughts on this, it would be highly appreciated.

Regards, Dav

imputation reference panel • 6.5k views

ADD COMMENT • link updated 15 months ago by analyst ▴ 70 • written 7.7 years ago by David_emir ▴ 500

score 2 · Answer 1 · 2017-12-19

2

Entering edit mode

7.7 years ago

Kevin Blighe 89k

IMPUTE2 will allow you to do this, but it may take you a full day to understand the various ways in which the program works. One of their tutorials actually goes through the use of a custom reference dataset (for imputation) alongside the 1000 Genomes dataset. Take a look here: Merging Reference Panels.

You should first merge all of your VCFs together. To merge them all together initially, I would recommend the following:

gzip and tab-index each individually (VCFtools and tabix)
normalise by left-aligning indels and splitting multi-allelic sites (BCFtools norm)
merge all together (BCFtools merge)

.

Edit (October 23, 2018):

When you have your samples in a single VCF, convert them to GEN format for IMPUTE2 with this script

ADD COMMENT • link 5.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks a lot Kevin, i am so happy to see a solution, at last! I will try doing this. hopefully i will create my own reference panel. I will go through the impute2 tutorials. I have a question- this reference panel will be compatible with Michigan Imputation Server?

ADD REPLY • link 7.7 years ago by David_emir ▴ 500

0

Entering edit mode

Yes, the home page states that they support VCF format, so, I would follow the 3 step process that I mention above and get your files in a single large VCF. if you need help with code for this, then let me know. I do this routinely.

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks a Lot Kevin, It would be great if you can help me with the script. I would be thankful for this help.

ADD REPLY • link 7.7 years ago by David_emir ▴ 500

2

Entering edit mode

For merging them all, you can try to follow this. It assumes that your VCFs are in /home/MyDir/Raw/. You will also need a FASTA reference genome whose build matches that of the one used to call variants in your VCF (it's possible to avoid this step though - you'll come across it).

This:

gzips and tab-indexes each VCF
Normalises each VCF to produce a BCF
Merges all BCFs into a single VCF

Get a list of all VCF samples

cd /home/MyDir/ ;
find /home/MyDir/Raw/ -name "*.vcf" | sort > /home/MyDir/VCF.list ;

Loop through the files and bgzip/tabix each

while read VCFfile
do
    bgzip -f "$VCFfile" ;
    tabix -f -p vcf "$VCFfile".gz ;
done < /home/MyDir/VCF.list

Get a list of all gzVCF samples

find /home/MyDir/Raw/ -name "*.vcf.gz" | sort > /home/MyDir/gzVCF.list ;

Normalise all files and store as BCF

#1st pipe, splits multi-allelic calls into separate variant calls
#2nd pipe, left-aligns indels and issues warnings when the REF base in your VCF does not match the base in the supplied FASTA reference genome
while read gzVCF
do
    bcftools norm -m-any "${gzVCF}" | bcftools norm -Ob --check-ref w -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta > "${gzVCF}".bcf ;

    /Programs/bcftools-1.3.1/bcftools index "${gzVCF}".bcf ;
done < /home/MyDir/gzVCF.list

Get a list of all bcf samples

find /home/MyDir/Raw/ -name "*.vcf.gz.bcf" | sort > /home/MyDir/BCF.list ;

Merge all BCFs

#command="/Programs/bcftools-1.3.1/bcftools merge -f PASS -Ov -m none " ; #just keep PASS variants

command="/Programs/bcftools-1.3.1/bcftools merge -Ov -m none " ;

while read BCF
do
    command="${command}"" ""${BCF}" ;
done < /home/MyDir/BCF.list

command="${command}"" -o /home/MyDir/ProjectMerge.vcf" ;

echo `$command` ;

bgzip -f /home/MyDir/ProjectMerge.vcf ;
tabix -f -p vcf /home/MyDir/ProjectMerge.vcf.gz ;

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

1

Entering edit mode

Thanks a Lot Kevin, I am great full to you. thanks a lot ..

ADD REPLY • link 7.7 years ago by David_emir ▴ 500

0

Entering edit mode

So, is the final file "ProjectMerge.vcf" the reference panel. Before, I go into a long process do you have a small snippet of "reference panel" to share. I would like to look into the data structure.

Thanks,

ADD REPLY • link 7.4 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Yes, it is just a large VCF containing all variants called in the healthy controls or, to be more specific, the samples that you choose as your reference panel for imputation. It just looks like any standard VCF with multi-samples. Sorry, I cannot show a screenshot right now as I'm on the wrong laptop.

ADD REPLY • link 7.4 years ago by Kevin Blighe 89k

0

Entering edit mode

Oh, alright. Any other time is good.

ADD REPLY • link 7.4 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Hi Kevin and Kirannbishwa01, I am a newbie in bioinformatics and trying to create a reference panel following the steps here. The above steps work well except for this step involving the Perl scripts "When you have your samples in a single VCF, convert them to GEN format for IMPUTE2 with this script (https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#scripts)". Any code examples of how you proceeded will be helpful

ADD REPLY • link 5.0 years ago by jmukisa90 ▴ 30

0

Entering edit mode

Hi, this thread is quite old, and the IMPUTE pages are even older. What I believe you should do is:

divide your VCF into different chromosomes (1 VCF per chromosome)
import each VCF to PLINK format
perform pre-phasing and imputation ( start here C: Phasing with SHAPEIT )

ADD REPLY • link 5.0 years ago by Kevin Blighe 89k

1

Entering edit mode

thanks, Kevin for the advice. I will try it out.

ADD REPLY • link 5.0 years ago by jmukisa90 ▴ 30

0

Entering edit mode

Hi Kevin!

I have 80 samples of GBS data. I have called variants through GATK pipeline. Now I have to perform imputation.

Do I need to use these 80 samples for building reference panel?

Thanks for your help!

ADD REPLY • link 15 months ago by analyst ▴ 70