How to remove duplicate sites and inconsistent sites from a VCF?
1
2
Entering edit mode
6.0 years ago
Sharon ▴ 610

Hi Everyone

I am trying to use Michigan Impute Server. I use Checkvcf first to avoid failure in the server.

python checkVCF.py -r checkVCF/hs37d5.fa -o test chr3.vcf

I got some duplicates and inconsistent ref.

> checkVCF.py -- check validity of VCF file for meta-analysis version
> 1.4 (20140115) contact zhanxw@umich.edu or dajiang@umich.edu for problems. Python version is [ 2.7.5.final.0 ]  Begin checking vcfFile
> [ chr3.vcf ] Duplicated site [ 3:14187449 ] Duplicated site [
> 3:21307401 ] Duplicated site [ 3:38608045 ] Duplicated site [
> 3:39146429 ] Duplicated site [ 3:41912651 ] [ 10000 ] lines processed 
> Duplicated site [ 3:48618728 ] Duplicated site [ 3:79399575 ]
> Duplicated site [ 3:95176677 ] Duplicated site [ 3:96472739 ]
> Duplicated site [ 3:99067458 ] [ 20000 ] lines processed  Duplicated
> site [ 3:113876275 ] Duplicated site [ 3:120522716 ] Duplicated site [
> 3:121633904 ] Duplicated site [ 3:128622922 ] [ 30000 ] lines
> processed  Duplicated site [ 3:171926373 ] Duplicated site [
> 3:183371250 ]
> ---------------     REPORT     --------------- Total [ 37146 ] lines processed Examine [ 33 ] VCF header lines, [ 37113 ] variant sites, [
> 378 ] samples [ 16 ] duplicated sites [ 0 ] NonSNP site are outputted
> to [ test.check.nonSnp ] [ 6995 ] Inconsistent reference sites are
> outputted to [ test.check.ref ] [ 0 ] Variant sites with invalid
> genotypes are outputted to [ test.check.geno ] [ 0 ] Alternative
> allele frequency > 0.5 sites are outputted to [ test.check.af ] [ 0 ]
> Monomorphic sites are outputted to [ test.check.mono ]
> ---------------     ACTION ITEM     ---------------
> * Remove duplicated sites and rerun checkVCF.py
> * Read test.check.ref, for autosomal sites, make sure the you are using the forward strand
> * Upload these files to the ftp server (so we can double check): test.check.log test.check.dup test.check.noSnp test.check.ref
> test.check.geno test.check.af test.check.mono

How can I remove this duplicate sites and inconsistent reference sites?

I tried this but it seems it excludes duplicate variants not sites:

plink --bfile snps_filtered --list-duplicate-vars ids-only suppress-first
plink --bfile snps_filtered --exclude plink.dupvar --make-bed --out snps.DuplicatesRemoved 
plink --bfile snps_filtered --recode vcf  --snps-only just-acgt  --out snps.final

A link to where is this in plink will be okay too.

Thanks

vcf plink michigan impute server • 2.4k views
ADD COMMENT
1
Entering edit mode
6.0 years ago

I note that you are comparing to hs37d5.fa, but to which genome was the original sample aligned? It would likely be recorded in your VCF header.

ADD COMMENT
1
Entering edit mode

Good catch. I should use Ghr37, I will check if this will remove the duplications. Thanks Kevin a lot. Always helpful.

ADD REPLY

Login before adding your answer.

Traffic: 2560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6