Filtered a .vcf file based on id of a second file
0
0
Entering edit mode
5.4 years ago
Will ▴ 20

I have a .vcf file of 10 GB with different columns (pos, ref, alt, quality,ecc..) and I have a column for each patient !!! I have another file containing only the id of the patients (only one column) that I want to analyze. How can I filter the 1° file deleting the columns that don't appear in the 2° file ??? I try to use bcftools: bcftools view -S sample_file.txt file.vcf > filtered.vcf, but the result contains only 2 columns and the total number of patients can be 300.

thanks in advance

SNP snp bcftools vcf • 2.9k views
ADD COMMENT
0
Entering edit mode

I have a column for each patient !!!

Yes, that's how VCF files work :-)

Your bcftools command looks OK, can you please paste the output of the following commands as a reply to my comment:

  1. head sample_file.txt | cat -te #if this errors out, try cat -A instead of cat -te
  2. bcftools query -l file.vcf
  3. bcftools -v
ADD REPLY
0
Entering edit mode

Ok, the sample file contains the id while the file contains all data: The command that generate the file.vcf give an error "subset called for sample that does not exist in header", so I used --force-samples command 1:

##fileformat=VCFv4.1$
##filedate=2019.6.21$
##source=Minimac3$
##contig=<ID=21>$
##FILTER=<ID=GENOTYPED,Description="Marker was genotyped AND imputed">$
##FILTER=<ID=GENOTYPED_ONLY,Description="Marker was genotyped but NOT imputed">$
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">$
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">$
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">$
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">$

command 2:

003_S_1057
003_S_0908
003_S_1122
136_S_0695
136_S_0873
130_S_0886
012_S_1133
003_S_1074
037_S_0501
027_S_0074
031_S_0351
051_S_1072
052_S_0951
016_S_1117
002_S_2010
128_S_2002
127_S_0622
014_S_0658
.. and so on, it's very very long

command 3:

bcftools 1.9-204-g6244ac9
Using htslib 1.9-247-g6eacc77
Copyright (C) 2018 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
ADD REPLY
0
Entering edit mode

Command 1 given by RamRS should give you also a list of sample names. Instead you are showing the header of the vcf file. Have you take accidentally the wrong file?

ADD REPLY
0
Entering edit mode

In the second file I have only the name of the patients (id) -->sample_file.txt; the command are lunched on the complete vcf with different columns about properties.

ADD REPLY
0
Entering edit mode

Please refer to specific files using the filenames you've used in the command instead of "first file", "second file" etc. Like finswimmer points out, your if command-1 (head sample_file.txt | cat -te) outputs the text you've shown above, it's not a list of samples, it's a VCF file.

Please ensure that:

  1. sample_file.txt contains one sample identifier per line
  2. file.vcf is a VCF file
  3. The sample IDs in the sample_file.txt file overlap the sample IDs in the VCF.
ADD REPLY

Login before adding your answer.

Traffic: 2435 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6