I have a multigenome vcf file. Suppose the file has samples A to Z, but I want to extract the subset of samples B to G and extract a small vcf file. How can I make such subset vcf file?
I have a multigenome vcf file. Suppose the file has samples A to Z, but I want to extract the subset of samples B to G and extract a small vcf file. How can I make such subset vcf file?
from https://samtools.github.io/bcftools/bcftools.html#view
bcftools view -s samplelist
or
bcftools view -S samplefile
would do the job. docs are your friends ;)
I have created this bash loop to loop over files (by chromosome or any vcf file). Then using vcf-subset tool, I was able to extract the subset file. Here, sample.txt is the list of samples per line. No need to tabix
or bgzip
parent vcf files with this method, but is a bit slower.
for i in /path/dir/*.vcf; do
vcf-subset -c sample.txt "$i" | bgzip -c > /get/inthis/dir/output_"${i##*/}"_.vcf.gz
done
GATK selectVariants https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php and option
--exclude_sample_file (file)
or
--sample_file (file)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
@Jorge Amigo's answer in this thread would be recent: How To Split Multiple Samples In Vcf File Generated By Gatk?
@genomax2 Thanks, but this only explains how to extract individual sample per file. Is there a way to input the list of samples I want to extract (for example, samples B,C,D,E,F and G) and get a subset file with these samples only?
I don't know how to do it in vcf format, but you can convert into plink format (plink --double-id --vcf your.vcf --recode --make-bed --out your_output), then from generated fam file select the individuals you want and extract them with(plink --bfile your_plink --keep list_of_individuals --recode --out your_output). Then you can convert back to vcf if you wish :x