Amount of unique variants in VCF with linux terminal
2
0
Entering edit mode
3.5 years ago
HL ▴ 10

I have been trying to search how could I get the amount of unique variants from a very big VCF file with linux terminal, without any results. Also I would like to get the unique variants to a different file without any duplicates.

variants unique VCF linux • 3.0k views
ADD COMMENT
1
Entering edit mode

What do you mean by 'unique variants'?

ADD REPLY
0
Entering edit mode

I would need to find variants that has same CHROM POS REF and ALT. And then remove the duplicates so that I would get these variants only once to the new file. Also I would need to know the amount of duplicated variants.

For example this file I need to know that the total amount of variants is 4 and amount of different variants is 3.

CHROM   POS  REF  ALT
1                1234    A        T
1                1234    A        T
1                3457    T        C
1                5468    G        C

And then I need to get the different variants to their own file like this.

CHROM   POS  REF  ALT
1                1234    A        T
1                3457    T        C
1                5468    G        C
ADD REPLY
0
Entering edit mode

Maybe try bcftools:

bcftools query -f '%ID\n' myFile.vcf
ADD REPLY
0
Entering edit mode

Thanks, this print all the ID:s but I would need only the ones that has two or more same CHROM POS REF and ALT

ADD REPLY
0
Entering edit mode

HL : Please do not delete posts that have received comments/answers.

ADD REPLY
0
Entering edit mode
3.5 years ago
sbstevenlee ▴ 480

Check out the pyvcf.VcfFrame.drop_duplicates method I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['1', '1', '1', '1'],
...     'POS': [1234, 1234, 3457, 5468],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'T', 'G'],
...     'ALT': ['T', 'T', 'C', 'C'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', './.', '0/1', '0/0'],
...     'B': ['./.', '0/1', '0/0', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf.df
  CHROM   POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0     1  1234  .   A   T    .      .    .     GT  0/1  ./.
1     1  1234  .   A   T    .      .    .     GT  ./.  0/1
2     1  3457  .   T   C    .      .    .     GT  0/1  0/0
3     1  5468  .   G   C    .      .    .     GT  0/0  0/1
>>> filtered_vf = vf.drop_duplicates(['CHROM', 'POS', 'REF', 'ALT'])
>>> filtered_vf.df
  CHROM   POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0     1  1234  .   A   T    .      .    .     GT  0/1  ./.
1     1  3457  .   T   C    .      .    .     GT  0/1  0/0
2     1  5468  .   G   C    .      .    .     GT  0/0  0/1
>>> filtered_vf.to_file('out.vcf')
ADD COMMENT
0
Entering edit mode

The above operation seems to lose information - the GT for sample B at locus 1:1234 is lost after the drop operation. You may want to pick called GTs over uncalled GTs when processing duplicates (this won't be an easy operation)

ADD REPLY
1
Entering edit mode

Thanks for the reply, Ram. You're correct. By default (keep='first'), the method will only keep the first of the duplicate records. I thought this was sufficient for the OP's case because it sounded like he or she is more interested in getting the total number of unique records than getting the collapsed genotypes. If the desired output is a collapsed VCF, then the OP is advised to take a look at the pyvcf.VcfFrame.collapse method.

ADD REPLY
0
Entering edit mode
3.5 years ago
mbk0asis ▴ 700

Why don't you just paste(?) columns together as single strings and use 'uniq' command?

CHROM POS REF ALT    
1 1234 A T
1 1234 A T    
1 3457 T C    
1 5468 G C

Somethin like this?

sed 's/\t/_/g' FILE | sort | uniq -c

  2 1_1234_A_T
  1 1_3457_T_C
  1 1_5468_G_C
  1 CHROM_POS_REF_ALT

Numbers in front are the duplicate counts.

ADD COMMENT
0
Entering edit mode

I don't think the paste is required, uniq should work fine without it too.

ADD REPLY
0
Entering edit mode

Thanks this worked and also I got the unique ones with this too:

bcftools query -f '%CHROM %POS %REF %ALT\n' FILE > UNIQFILE

After this I just had to make some headers for this UNIQFILE to be a vcf file again. Also I have to find a way to get for these variants all of the information (for example SAMPLE QUAL...) but I think I can get it with bedtools intersect.

ADD REPLY
1
Entering edit mode

That doesn't make any sense, HL. You're not performing any uniquification operations. There should at least be a uniq before you redirect the output to UNIQFILE.

You either need to apply uniq directly to the data lines in the VCF file or you need to apply it to a subset of the columns. What you're doing here will just recreate the file after taking a few detours that do a bunch of stuff serving no purpose.

ADD REPLY

Login before adding your answer.

Traffic: 1849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6