How to remove samples with no genotypes from vcf
5
0
Entering edit mode
3.5 years ago
moxu ▴ 510

My vcf file has many samples with no genotypes, i.e. all genotypes are missing ('./.'). How to remove such samples?

Thanks.

VCF • 5.0k views
ADD COMMENT
1
Entering edit mode
3.5 years ago
4galaxy77 2.9k

This will give you the samples with all missing genotypes (not tested).

plink2 --vcf in.vcf --missing -out missing
awk ' $3 == 0 ' missing.smiss | tail -n +2 > missing_samples.txt
bcftools view -S ^missing_samples.txt in.vcf > out.vcf
ADD COMMENT
0
Entering edit mode

Hi, I just gave this a try, but missing_samples.txt has nothing in it.

I ran

plink2 --vcf Annot_Filtered.vcf.gz --missing -out missing 
awk ' $3 == 0 ' missing.smiss | tail -n +2 > missing_samples.txt
bcftools view -S ^missing_samples.txt in.vcf > out.vcf

Looking at missing.smiss:

head missining.smiss
#IID    MISSING_CT      OBS_CT  F_MISS
S001    301     16195   0.018586
S002    356     16195   0.0219821
S003    286     16195   0.0176598
S004    290     16195   0.0179068
S005    654     16195   0.0403828
S006    473     16195   0.0292065
S007    661     16195   0.0408151
S008    820     16195   0.0506329
S009    417     16195   0.0257487

Which looks like it's not showing any values missing. However when I take a look at my vcf, there are definitely samples with missing genotypes (./.): enter image description here

ADD REPLY
1
Entering edit mode
3.5 years ago
pbpanigrahi ▴ 430

First separate header and data

grep -vP  '^#' input.vcf > data.vcf
grep -P  '^#' input.vcf > header.vcf

Use R to remove . Open R terminal. Move to the direcotry in which data.vcf is there and run below

df = read.table("data.vcf", sep="\t")

# till format column
df1 = df[,1:9]
# sample columns
df2 = df[,10:ncol(df)]
# remove samples with ./.
df2 = df2[,-which(apply(df2,2,function(x){sum(x=="./.")==length(x)})), drop = F]
# rejoin with data till format
df = cbind(df1,df2)
# export
write.table(df, file = "temp2.vcf", col.names = F, row.names = F, quote = F,sep="\t", eol = "\n")

Rejoin

cat header.vcf > final.vcf
cat temp2.vcf >> final.vcf

Above solution will work if

  • Multiple sample present in vcf file
  • At least one sample have non ./. entry
ADD COMMENT
0
Entering edit mode
3.5 years ago

not tested, a one-liner, use bcftools stats to count the number of called genotypes per samples, use those samples to select the called samples with bcftools view.

bcftools view --samples `bcftools stats -s '-' in.vcf.gz | awk '$1=="PSC" && (int($4)+int($5)+int($6) > 0) {print $3}' | paste -s -d',' ` in.vcf.gz
ADD COMMENT
0
Entering edit mode

Hi Pierre, I just tested this myself and received the error:

bcftools/1.16/bin/bcftools: Argument list too long
ADD REPLY
0
Entering edit mode
3.5 years ago
sbstevenlee ▴ 480

If you are a Python user, you may want to check out the pyvcf submodule I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'Steven': ['0/1', '1/1'],
...     'Rachel': ['./.', './.']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data) # vf = pyvcf.VcfFrame.from_file('your_file.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Rachel
0  chr1  100  .   G   A    .      .    .     GT    0/1    ./.
1  chr2  101  .   T   C    .      .    .     GT    1/1    ./.
>>> f = lambda r: r[9:].apply(pyvcf.gt_miss)
>>> s = vf.df.apply(f, axis=1).all()
>>> l = s[s == False].index.to_list()
>>> filtered_vf = vf.subset(l)
>>> #filtered_vf.to_file('filtered.vcf')
>>> filtered_vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr2  101  .   T   C    .      .    .     GT    1/1
ADD COMMENT
0
Entering edit mode
3.5 years ago
sbstevenlee ▴ 480

FYI, your question motivated me to write the pyvcf.VcfFrame.cfilter_empty method for the v0.11.0 version.

ADD COMMENT

Login before adding your answer.

Traffic: 1880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6