CLI solution
Check out the vcf_merge
command I wrote:
$ fuc vcf_merge -h
usage: fuc vcf_merge [-h] [--how TEXT] [--format TEXT] [--sort] [--collapse]
vcf_files [vcf_files ...]
This command will merge multiple VCF files (both zipped and unzipped). It
essentially wraps the 'pyvcf.merge' method from the fuc API.
By default, only the GT subfield of the FORMAT field will be included in the
merged VCF. Use '--format' to include additional FORMAT subfields such as AD
and DP.
usage examples:
$ fuc vcf_merge 1.vcf 2.vcf 3.vcf > merged.vcf
positional arguments:
vcf_files VCF files
optional arguments:
-h, --help show this help message and exit
--how TEXT type of merge as defined in `pandas.DataFrame.merge`
(default: 'inner')
--format TEXT FORMAT subfields to be retained (e.g. 'GT:AD:DP') (default:
'GT')
--sort use this flag to turn off sorting of records (default: True)
--collapse use this flag to collapse duplicate records (default: False)
API solution
If you are familiar with Python and are planning on performing additional analyses on the merged VCF (e.g. filtering), you can also utilize the pyvcf.merge
method I wrote:
Assume we have the following data:
>>> from fuc import pyvcf
>>> data1 = {
... 'CHROM': ['chr1', 'chr1'],
... 'POS': [100, 101],
... 'ID': ['.', '.'],
... 'REF': ['G', 'T'],
... 'ALT': ['A', 'C'],
... 'QUAL': ['.', '.'],
... 'FILTER': ['.', '.'],
... 'INFO': ['.', '.'],
... 'FORMAT': ['GT:DP', 'GT:DP'],
... 'Steven': ['0/0:32', '0/1:29'],
... 'Sara': ['0/1:24', '1/1:30'],
... }
>>> data2 = {
... 'CHROM': ['chr1', 'chr1', 'chr2'],
... 'POS': [100, 101, 200],
... 'ID': ['.', '.', '.'],
... 'REF': ['G', 'T', 'A'],
... 'ALT': ['A', 'C', 'T'],
... 'QUAL': ['.', '.', '.'],
... 'FILTER': ['.', '.', '.'],
... 'INFO': ['.', '.', '.'],
... 'FORMAT': ['GT:DP', 'GT:DP', 'GT:DP'],
... 'Dona': ['./.:.', '0/0:24', '0/0:26'],
... 'Michel': ['0/1:24', '0/1:31', '0/1:26'],
... }
>>> vf1 = pyvcf.VcfFrame.from_dict([], data1)
>>> vf2 = pyvcf.VcfFrame.from_dict([], data2)
>>> vf1.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara
0 chr1 100 . G A . . . GT:DP 0/0:32 0/1:24
1 chr1 101 . T C . . . GT:DP 0/1:29 1/1:30
>>> vf2.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Dona Michel
0 chr1 100 . G A . . . GT:DP ./.:. 0/1:24
1 chr1 101 . T C . . . GT:DP 0/0:24 0/1:31
2 chr2 200 . A T . . . GT:DP 0/0:26 0/1:26
We can merge the two VcfFrames with how='inner'
(default):
>>> pyvcf.merge([vf1, vf2]).df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara Dona Michel
0 chr1 100 . G A . . . GT 0/0 0/1 ./. 0/1
1 chr1 101 . T C . . . GT 0/1 1/1 0/0 0/1
We can also merge with how='outer'
:
>>> pyvcf.merge([vf1, vf2], how='outer').df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara Dona Michel
0 chr1 100 . G A . . . GT 0/0 0/1 ./. 0/1
1 chr1 101 . T C . . . GT 0/1 1/1 0/0 0/1
2 chr2 200 . A T . . . GT ./. ./. 0/0 0/1
Since both VcfFrames have the DP subfield, we can use format='GT:DP'
:
>>> pyvcf.merge([vf1, vf2], how='outer', format='GT:DP').df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Steven Sara Dona Michel
0 chr1 100 . G A . . . GT:DP 0/0:32 0/1:24 ./.:. 0/1:24
1 chr1 101 . T C . . . GT:DP 0/1:29 1/1:30 0/0:24 0/1:31
2 chr2 200 . A T . . . GT:DP ./.:. ./.:. 0/0:26 0/1:26
what's the output of
Here is the output:
Dear @Pierre, I am still having this issue while merging multiple vcf files. Could you please share any suggestions to fix it? Thank you!