My vcf file has many samples with no genotypes, i.e. all genotypes are missing ('./.'). How to remove such samples?
Thanks.
My vcf file has many samples with no genotypes, i.e. all genotypes are missing ('./.'). How to remove such samples?
Thanks.
This will give you the samples with all missing genotypes (not tested).
plink2 --vcf in.vcf --missing -out missing
awk ' $3 == 0 ' missing.smiss | tail -n +2 > missing_samples.txt
bcftools view -S ^missing_samples.txt in.vcf > out.vcf
First separate header and data
grep -vP '^#' input.vcf > data.vcf
grep -P '^#' input.vcf > header.vcf
Use R to remove . Open R terminal. Move to the direcotry in which data.vcf is there and run below
df = read.table("data.vcf", sep="\t")
# till format column
df1 = df[,1:9]
# sample columns
df2 = df[,10:ncol(df)]
# remove samples with ./.
df2 = df2[,-which(apply(df2,2,function(x){sum(x=="./.")==length(x)})), drop = F]
# rejoin with data till format
df = cbind(df1,df2)
# export
write.table(df, file = "temp2.vcf", col.names = F, row.names = F, quote = F,sep="\t", eol = "\n")
Rejoin
cat header.vcf > final.vcf
cat temp2.vcf >> final.vcf
Above solution will work if
not tested, a one-liner, use bcftools stats to count the number of called genotypes per samples, use those samples to select the called samples with bcftools view.
bcftools view --samples `bcftools stats -s '-' in.vcf.gz | awk '$1=="PSC" && (int($4)+int($5)+int($6) > 0) {print $3}' | paste -s -d',' ` in.vcf.gz
If you are a Python user, you may want to check out the pyvcf
submodule I wrote:
>>> from fuc import pyvcf
>>> data = {
... 'CHROM': ['chr1', 'chr2'],
... 'POS': [100, 101],
... 'ID': ['.', '.'],
... 'REF': ['G', 'T'],
... 'ALT': ['A', 'C'],
... 'QUAL': ['.', '.'],
... 'FILTER': ['.', '.'],
... 'INFO': ['.', '.'],
... 'FORMAT': ['GT', 'GT'],
... 'Steven': ['0/1', '1/1'],
... 'Rachel': ['./.', './.']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data) # vf = pyvcf.VcfFrame.from_file('your_file.vcf')
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Steven Rachel
0 chr1 100 . G A . . . GT 0/1 ./.
1 chr2 101 . T C . . . GT 1/1 ./.
>>> f = lambda r: r[9:].apply(pyvcf.gt_miss)
>>> s = vf.df.apply(f, axis=1).all()
>>> l = s[s == False].index.to_list()
>>> filtered_vf = vf.subset(l)
>>> #filtered_vf.to_file('filtered.vcf')
>>> filtered_vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0 chr1 100 . G A . . . GT 0/1
1 chr2 101 . T C . . . GT 1/1
FYI, your question motivated me to write the pyvcf.VcfFrame.cfilter_empty
method for the v0.11.0 version.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi, I just gave this a try, but missing_samples.txt has nothing in it.
I ran
Looking at missing.smiss:
Which looks like it's not showing any values missing. However when I take a look at my vcf, there are definitely samples with missing genotypes (./.):