Removing homozygous reference genotypes from multi-sample vcf file
3
4
Entering edit mode
5.5 years ago
seta ★ 1.9k

Hi everybody,

I want to remove all sites that are homozygous reference (0/0) per each row in the vcf file and get the output with vcf format. So, I simply used zgrep -v "0[/|]0" < file1.gz.vcf > output.vcf. However, it sounds that many variants is removed. Just for make sure, please kindly let me know if I did right?

Thanks

genotype removing vcf • 7.8k views
ADD COMMENT
3
Entering edit mode
5.5 years ago

I strongly recommend using existing tools for filtering vcf files.

With bcftools it works like this:

bcftools view -i 'GT[*]="alt"' input.vcf

This will give you any site where at least one sample has an alternate allel.

ADD COMMENT
0
Entering edit mode

as far as I undersand OP don't want ANY HOM_REF sample in the row.

ADD REPLY
1
Entering edit mode

Hmm, yes might be.

In this case:

bcftools view -e 'GT[*]="RR"' input.vcf
ADD REPLY
0
Entering edit mode

Right, I just hetero and homo-var NOT homo-ref per row. I before used the bcftools view -e 'GT[*]="RR"' file.vcf, but sounds worked wrong as it just kept 530 variants from 2134618 variants. Our admin must install Pierre's tool on the cluster to test it.

ADD REPLY
1
Entering edit mode

For me this does not sound surprisingly.

If you have no sample within 1200 samples, which is hom REF on a given position, this means, that what is called ref, is very rare. Of course this is possible, but I would guess it is rare.

ADD REPLY
1
Entering edit mode

Our admin must install Pierre's tool on the cluster to test it

As long as you have java installed on the cluster you should be able to run Pierre's program from your own directory.

ADD REPLY
1
Entering edit mode
5.5 years ago

bcftools view --min-ac 1 input.vcf

using vcffilterjdk http://lindenb.github.io/jvarkit/VcfFilterJdk.html

java -jar src/jvarkit-git/dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().noneMatch(G->G.isHomRef() || G.isNoCall());'  input.vcf.gz
ADD COMMENT
0
Entering edit mode

Sorry, I forgot to mention the vcf file is multisample with about 1200 individuals. So, the above command didn't work. How I can get just hom-var and heterozygous genotypes per each row (variant)?

ADD REPLY
0
Entering edit mode

I updated my answer.

ADD REPLY
1
Entering edit mode
5.5 years ago
Dave Carlson ★ 2.0k

I like to use SnpSift for pulling out specific genotype combinations. In your case, I think you could do the following:

zcat file1.gz.vcf | java -jar SnpSift.jar filter "isHom( GEN[0] )  & isRef( GEN[0] ) " --inverse > output.vcf

This will pull out all lines in the vcf that are not homozygous reference. If you're trying to do this on a specific sample, just adjust "GEN[0]" to reference the sample you're looking for (e.g., "GEN[1]" for the second sample, "GEN[2]" for the third sample, and so on).

ADD COMMENT

Login before adding your answer.

Traffic: 1629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6