Question

Keeping only common variants in the merged VCF file

0

Entering edit mode

6.2 years ago

seta ★ 1.9k

Hi all,

After merging my vcf file containing specific variants with those variants in 1000 genome vcf, the ID column of merged VCF file is like below:

chr1:39440410:SG

rs6722104

rs60323161;chr1:39244787:SG

which only the rs60323161;chr1:39244787:SG are common variants. Please kindly let me know how can keep only common variants in the merged vcf file?

I used bcftools view -T for keeping just common variants, but it didn't work well; actually, the variants like below is still exist in the file, which chr1:39448418:SG should be removed

rs3118014;chr1:39448418:SG

chr1:39448418:SG

I also tested grep -Fwvf and grep -vf for removing those variants, but none of them works well. Please kindly share me your solution?

Thanks

VCF merge bcftools • 3.2k views

ADD COMMENT • link updated 6.2 years ago by husensofteng ▴ 410 • written 6.2 years ago by seta ★ 1.9k

score 1 · Answer 1 · 2019-05-25

1

Entering edit mode

6.2 years ago

husensofteng ▴ 410

I am not sure if I understand the question correctly, but it sounds as a line filtering issue to me. So:

awk '$1~"#" || ($3~"rs" && $3~"chr")' inputfile > outputfile

*Only keep lines that start with # (header lines) or there is rs ID and chr info at the third column of the file.

ADD COMMENT • link 6.2 years ago by husensofteng ▴ 410

0

Entering edit mode

Many thank for your nice solution.

ADD REPLY • link 6.2 years ago by seta ★ 1.9k

score 0 · Answer 2 · 2019-05-25

0

Entering edit mode

6.2 years ago

harold.smith.tarheel ★ 5.0k

Two options:

1) use BEDtools 'intersect' for the two original VCFs.

2) use VCFtools 'vcf-annotate' to add the 1000 Genomes rs numbers, then 'grep' to keep the variants that were annotated as such.

ADD COMMENT • link 6.2 years ago by harold.smith.tarheel ★ 5.0k