Entering edit mode
5.4 years ago
Will
▴
20
I have a .vcf file of 10 GB with different columns (pos, ref, alt, quality,ecc..) and I have a column for each patient !!!
I have another file containing only the id of the patients (only one column) that I want to analyze. How can I filter the 1° file deleting the columns that don't appear in the 2° file ??? I try to use bcftools: bcftools view -S sample_file.txt file.vcf > filtered.vcf
, but the result contains only 2 columns and the total number of patients can be 300.
thanks in advance
Yes, that's how VCF files work :-)
Your bcftools command looks OK, can you please paste the output of the following commands as a reply to my comment:
head sample_file.txt | cat -te
#if this errors out, trycat -A
instead ofcat -te
bcftools query -l file.vcf
bcftools -v
Ok, the sample file contains the id while the file contains all data: The command that generate the file.vcf give an error "subset called for sample that does not exist in header", so I used --force-samples command 1:
command 2:
command 3:
Command 1 given by RamRS should give you also a list of sample names. Instead you are showing the header of the vcf file. Have you take accidentally the wrong file?
In the second file I have only the name of the patients (id) -->sample_file.txt; the command are lunched on the complete vcf with different columns about properties.
Please refer to specific files using the filenames you've used in the command instead of "first file", "second file" etc. Like finswimmer points out, your if command-1 (
head sample_file.txt | cat -te
) outputs the text you've shown above, it's not a list of samples, it's a VCF file.Please ensure that:
sample_file.txt
contains one sample identifier per linefile.vcf
is a VCF filesample_file.txt
file overlap the sample IDs in the VCF.