Question

Help merging VCF files divided by chromosome

0

Entering edit mode

10.1 years ago

devenvyas ▴ 770

I am trying to use Python to merge a set of VCF files that cannot be handled by vcftools or any other similar software due to non-standard format.

The files are divided by chromosome, and I want them in one file (without the header getting caught inside)

a

OutFile = open('AltaiNeanderthal.vcf', 'w')
filelist = ['AltaiNea.hg19_1000g.1.mod_filtered.vcf', 'AltaiNea.hg19_1000g.2.mod_filtered.vcf', 'AltaiNea.hg19_1000g.3.mod_filtered.vcf', 'AltaiNea.hg19_1000g.4.mod_filtered.vcf', 'AltaiNea.hg19_1000g.5.mod_filtered.vcf', 'AltaiNea.hg19_1000g.6.mod_filtered.vcf', 'AltaiNea.hg19_1000g.7.mod_filtered.vcf', 'AltaiNea.hg19_1000g.8.mod_filtered.vcf', 'AltaiNea.hg19_1000g.9.mod_filtered.vcf', 'AltaiNea.hg19_1000g.10.mod_filtered.vcf', 'AltaiNea.hg19_1000g.11.mod_filtered.vcf', 'AltaiNea.hg19_1000g.12.mod_filtered.vcf', 'AltaiNea.hg19_1000g.13.mod_filtered.vcf', 'AltaiNea.hg19_1000g.14.mod_filtered.vcf', 'AltaiNea.hg19_1000g.15.mod_filtered.vcf', 'AltaiNea.hg19_1000g.16.mod_filtered.vcf', 'AltaiNea.hg19_1000g.17.mod_filtered.vcf', 'AltaiNea.hg19_1000g.18.mod_filtered.vcf', 'AltaiNea.hg19_1000g.19.mod_filtered.vcf', 'AltaiNea.hg19_1000g.20.mod_filtered.vcf', 'AltaiNea.hg19_1000g.21.mod_filtered.vcf', 'AltaiNea.hg19_1000g.22.mod_filtered.vcf']

for infile in filelist:
    openfile = open(infile, 'r')
    for Line in infile:
        Line=Line.strip('\n')
        if Line[0] != '#':
            OutFile.write(Line + '\n')
    openfile.close()
OutFile.close()

and my output is basically line by line every character in the file names

A
l
t
a I N
e
a
.
h

Anyone know why it is doing this? Thanks!

(I know my code is (if it worked) is even omitting the first header, I plan on adding that back manually).

vcf python • 2.9k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by devenvyas ▴ 770

1

Entering edit mode

cat $(ls -1v  *.vcf) > Merged.vcf # (should concatenate individual vcf files in correct order if I remember correctly. I haven't tested it though)
grep -v "^#" Merged.vcf > Clean.vcf # (removes anything that starts with #)

Now get the header from any vcf file and add it to the top of the Clean.vcf file

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by Ashutosh Pandey 12k

1

Entering edit mode

Yeah that or this should also work I think (these are both bash shell linux by the way):

for f in /path/to/folder/* do; cat f | grep "^#" >> outfile.vcf; done
for f in /path/to/folder/* do; cat f | grep -v "^#" >> outfile.vcf; done

And if you want only one of the headers, just pick that specific file instead of iterating over the folder.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by Steven Lakin ★ 1.8k

0

Entering edit mode

Just a quick question: do you have access to a linux system?

ADD REPLY • link 10.1 years ago by Steven Lakin ★ 1.8k

Ram · Accepted Answer · 2015-06-25

I'd still recommend using shell as described in the above comments, but to answer your python question, your code is running against your original list:

for infile in filelist:
    openfile = open(infile, 'r')
    for Line in infile:

You called the for statement on the list of strings, not on openfile. Should be this:

for infile in filelist:
    openfile = open(infile, 'r')
    for Line in openfile:

Try this instead:

on Windows:

import glob
filelist = glob.glob(r"C:\my\folder\filepath\*")
with open(r"C:\my\folder\filepath\outFile.vcf", "w") as outFile:
    for file in filelist:
        with open(file, "r") as f:
            for line in f.read().strip():
                if line[0] != "#":
                    outFile.write(line)

On linux you can use normal strings for the filepaths instead of raw strings.