Question

Amount of unique variants in VCF with linux terminal

0

Entering edit mode

3.6 years ago

HL ▴ 10

I have been trying to search how could I get the amount of unique variants from a very big VCF file with linux terminal, without any results. Also I would like to get the unique variants to a different file without any duplicates.

variants unique VCF linux • 3.0k views

ADD COMMENT • link updated 3.5 years ago by Ram 44k • written 3.6 years ago by HL ▴ 10

1

Entering edit mode

What do you mean by 'unique variants'?

ADD REPLY • link 3.6 years ago by 4galaxy77 2.9k

0

Entering edit mode

I would need to find variants that has same CHROM POS REF and ALT. And then remove the duplicates so that I would get these variants only once to the new file. Also I would need to know the amount of duplicated variants.

For example this file I need to know that the total amount of variants is 4 and amount of different variants is 3.

CHROM   POS  REF  ALT
1                1234    A        T
1                1234    A        T
1                3457    T        C
1                5468    G        C

And then I need to get the different variants to their own file like this.

CHROM   POS  REF  ALT
1                1234    A        T
1                3457    T        C
1                5468    G        C

ADD REPLY • link updated 3.5 years ago by Ram 44k • written 3.6 years ago by HL ▴ 10

0

Entering edit mode

Maybe try bcftools:

bcftools query -f '%ID\n' myFile.vcf

ADD REPLY • link 3.6 years ago by zx8754 12k

0

Entering edit mode

Thanks, this print all the ID:s but I would need only the ones that has two or more same CHROM POS REF and ALT

ADD REPLY • link 3.6 years ago by HL ▴ 10

0

Entering edit mode

HL : Please do not delete posts that have received comments/answers.

ADD REPLY • link 3.5 years ago by GenoMax 148k

score 0 · Answer 1 · 2021-06-20

Check out the pyvcf.VcfFrame.drop_duplicates method I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['1', '1', '1', '1'],
...     'POS': [1234, 1234, 3457, 5468],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'T', 'G'],
...     'ALT': ['T', 'T', 'C', 'C'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', './.', '0/1', '0/0'],
...     'B': ['./.', '0/1', '0/0', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf.df
  CHROM   POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0     1  1234  .   A   T    .      .    .     GT  0/1  ./.
1     1  1234  .   A   T    .      .    .     GT  ./.  0/1
2     1  3457  .   T   C    .      .    .     GT  0/1  0/0
3     1  5468  .   G   C    .      .    .     GT  0/0  0/1
>>> filtered_vf = vf.drop_duplicates(['CHROM', 'POS', 'REF', 'ALT'])
>>> filtered_vf.df
  CHROM   POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0     1  1234  .   A   T    .      .    .     GT  0/1  ./.
1     1  3457  .   T   C    .      .    .     GT  0/1  0/0
2     1  5468  .   G   C    .      .    .     GT  0/0  0/1
>>> filtered_vf.to_file('out.vcf')

score 0 · Answer 2 · 2021-06-22

0

Entering edit mode

3.6 years ago

mbk0asis ▴ 700

Why don't you just paste(?) columns together as single strings and use 'uniq' command?

CHROM POS REF ALT    
1 1234 A T
1 1234 A T    
1 3457 T C    
1 5468 G C

Somethin like this?

sed 's/\t/_/g' FILE | sort | uniq -c

  2 1_1234_A_T
  1 1_3457_T_C
  1 1_5468_G_C
  1 CHROM_POS_REF_ALT

Numbers in front are the duplicate counts.

ADD COMMENT • link 3.6 years ago by mbk0asis ▴ 700

0

Entering edit mode

I don't think the paste is required, uniq should work fine without it too.

ADD REPLY • link 3.5 years ago by Ram 44k

0

Entering edit mode

Thanks this worked and also I got the unique ones with this too:

bcftools query -f '%CHROM %POS %REF %ALT\n' FILE > UNIQFILE

After this I just had to make some headers for this UNIQFILE to be a vcf file again. Also I have to find a way to get for these variants all of the information (for example SAMPLE QUAL...) but I think I can get it with bedtools intersect.

ADD REPLY • link 3.5 years ago by HL ▴ 10

1

Entering edit mode

That doesn't make any sense, HL. You're not performing any uniquification operations. There should at least be a uniq before you redirect the output to UNIQFILE.

You either need to apply uniq directly to the data lines in the VCF file or you need to apply it to a subset of the columns. What you're doing here will just recreate the file after taking a few detours that do a bunch of stuff serving no purpose.

ADD REPLY • link 3.5 years ago by Ram 44k