Question

Filtered a .vcf file based on id of a second file

0

Entering edit mode

5.4 years ago

Will ▴ 20

I have a .vcf file of 10 GB with different columns (pos, ref, alt, quality,ecc..) and I have a column for each patient !!! I have another file containing only the id of the patients (only one column) that I want to analyze. How can I filter the 1° file deleting the columns that don't appear in the 2° file ??? I try to use bcftools: bcftools view -S sample_file.txt file.vcf > filtered.vcf, but the result contains only 2 columns and the total number of patients can be 300.

thanks in advance

SNP snp bcftools vcf • 2.9k views

ADD COMMENT • link 5.4 years ago by Will ▴ 20

0

Entering edit mode

I have a column for each patient !!!

Yes, that's how VCF files work :-)

Your bcftools command looks OK, can you please paste the output of the following commands as a reply to my comment:

head sample_file.txt | cat -te #if this errors out, try cat -A instead of cat -te
bcftools query -l file.vcf
bcftools -v

ADD REPLY • link 5.4 years ago by Ram 44k

0

Entering edit mode

Ok, the sample file contains the id while the file contains all data: The command that generate the file.vcf give an error "subset called for sample that does not exist in header", so I used --force-samples command 1:

##fileformat=VCFv4.1$
##filedate=2019.6.21$
##source=Minimac3$
##contig=<ID=21>$
##FILTER=<ID=GENOTYPED,Description="Marker was genotyped AND imputed">$
##FILTER=<ID=GENOTYPED_ONLY,Description="Marker was genotyped but NOT imputed">$
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">$
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">$
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">$
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">$

command 2:

003_S_1057
003_S_0908
003_S_1122
136_S_0695
136_S_0873
130_S_0886
012_S_1133
003_S_1074
037_S_0501
027_S_0074
031_S_0351
051_S_1072
052_S_0951
016_S_1117
002_S_2010
128_S_2002
127_S_0622
014_S_0658
.. and so on, it's very very long

command 3:

bcftools 1.9-204-g6244ac9
Using htslib 1.9-247-g6eacc77
Copyright (C) 2018 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

ADD REPLY • link updated 5.4 years ago by finswimmer 16k • written 5.4 years ago by Will ▴ 20

0

Entering edit mode

Command 1 given by RamRS should give you also a list of sample names. Instead you are showing the header of the vcf file. Have you take accidentally the wrong file?

ADD REPLY • link 5.4 years ago by finswimmer 16k

0

Entering edit mode

In the second file I have only the name of the patients (id) -->sample_file.txt; the command are lunched on the complete vcf with different columns about properties.

ADD REPLY • link 5.4 years ago by Will ▴ 20

0

Entering edit mode

Please refer to specific files using the filenames you've used in the command instead of "first file", "second file" etc. Like finswimmer points out, your if command-1 (head sample_file.txt | cat -te) outputs the text you've shown above, it's not a list of samples, it's a VCF file.

Please ensure that:

sample_file.txt contains one sample identifier per line
file.vcf is a VCF file
The sample IDs in the sample_file.txt file overlap the sample IDs in the VCF.

ADD REPLY • link 5.4 years ago by Ram 44k