I have been trying to search how could I get the amount of unique variants from a very big VCF file with linux terminal, without any results. Also I would like to get the unique variants to a different file without any duplicates.
I have been trying to search how could I get the amount of unique variants from a very big VCF file with linux terminal, without any results. Also I would like to get the unique variants to a different file without any duplicates.
Check out the pyvcf.VcfFrame.drop_duplicates
method I wrote:
>>> from fuc import pyvcf
>>> data = {
... 'CHROM': ['1', '1', '1', '1'],
... 'POS': [1234, 1234, 3457, 5468],
... 'ID': ['.', '.', '.', '.'],
... 'REF': ['A', 'A', 'T', 'G'],
... 'ALT': ['T', 'T', 'C', 'C'],
... 'QUAL': ['.', '.', '.', '.'],
... 'FILTER': ['.', '.', '.', '.'],
... 'INFO': ['.', '.', '.', '.'],
... 'FORMAT': ['GT', 'GT', 'GT', 'GT'],
... 'A': ['0/1', './.', '0/1', '0/0'],
... 'B': ['./.', '0/1', '0/0', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A B
0 1 1234 . A T . . . GT 0/1 ./.
1 1 1234 . A T . . . GT ./. 0/1
2 1 3457 . T C . . . GT 0/1 0/0
3 1 5468 . G C . . . GT 0/0 0/1
>>> filtered_vf = vf.drop_duplicates(['CHROM', 'POS', 'REF', 'ALT'])
>>> filtered_vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A B
0 1 1234 . A T . . . GT 0/1 ./.
1 1 3457 . T C . . . GT 0/1 0/0
2 1 5468 . G C . . . GT 0/0 0/1
>>> filtered_vf.to_file('out.vcf')
Thanks for the reply, Ram. You're correct. By default (keep='first'
), the method will only keep the first of the duplicate records. I thought this was sufficient for the OP's case because it sounded like he or she is more interested in getting the total number of unique records than getting the collapsed genotypes. If the desired output is a collapsed VCF, then the OP is advised to take a look at the pyvcf.VcfFrame.collapse
method.
Why don't you just paste(?) columns together as single strings and use 'uniq' command?
CHROM POS REF ALT
1 1234 A T
1 1234 A T
1 3457 T C
1 5468 G C
Somethin like this?
sed 's/\t/_/g' FILE | sort | uniq -c
2 1_1234_A_T
1 1_3457_T_C
1 1_5468_G_C
1 CHROM_POS_REF_ALT
Numbers in front are the duplicate counts.
Thanks this worked and also I got the unique ones with this too:
bcftools query -f '%CHROM %POS %REF %ALT\n' FILE > UNIQFILE
After this I just had to make some headers for this UNIQFILE to be a vcf file again. Also I have to find a way to get for these variants all of the information (for example SAMPLE QUAL...) but I think I can get it with bedtools intersect.
That doesn't make any sense, HL. You're not performing any uniquification operations. There should at least be a uniq
before you redirect the output to UNIQFILE
.
You either need to apply uniq
directly to the data lines in the VCF file or you need to apply it to a subset of the columns. What you're doing here will just recreate the file after taking a few detours that do a bunch of stuff serving no purpose.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What do you mean by 'unique variants'?
I would need to find variants that has same CHROM POS REF and ALT. And then remove the duplicates so that I would get these variants only once to the new file. Also I would need to know the amount of duplicated variants.
For example this file I need to know that the total amount of variants is 4 and amount of different variants is 3.
And then I need to get the different variants to their own file like this.
Maybe try bcftools:
Thanks, this print all the ID:s but I would need only the ones that has two or more same CHROM POS REF and ALT
HL : Please do not delete posts that have received comments/answers.