Help merging VCF files divided by chromosome
1
0
Entering edit mode
9.4 years ago
devenvyas ▴ 760

I am trying to use Python to merge a set of VCF files that cannot be handled by vcftools or any other similar software due to non-standard format.

The files are divided by chromosome, and I want them in one file (without the header getting caught inside)

a

OutFile = open('AltaiNeanderthal.vcf', 'w')
filelist = ['AltaiNea.hg19_1000g.1.mod_filtered.vcf', 'AltaiNea.hg19_1000g.2.mod_filtered.vcf', 'AltaiNea.hg19_1000g.3.mod_filtered.vcf', 'AltaiNea.hg19_1000g.4.mod_filtered.vcf', 'AltaiNea.hg19_1000g.5.mod_filtered.vcf', 'AltaiNea.hg19_1000g.6.mod_filtered.vcf', 'AltaiNea.hg19_1000g.7.mod_filtered.vcf', 'AltaiNea.hg19_1000g.8.mod_filtered.vcf', 'AltaiNea.hg19_1000g.9.mod_filtered.vcf', 'AltaiNea.hg19_1000g.10.mod_filtered.vcf', 'AltaiNea.hg19_1000g.11.mod_filtered.vcf', 'AltaiNea.hg19_1000g.12.mod_filtered.vcf', 'AltaiNea.hg19_1000g.13.mod_filtered.vcf', 'AltaiNea.hg19_1000g.14.mod_filtered.vcf', 'AltaiNea.hg19_1000g.15.mod_filtered.vcf', 'AltaiNea.hg19_1000g.16.mod_filtered.vcf', 'AltaiNea.hg19_1000g.17.mod_filtered.vcf', 'AltaiNea.hg19_1000g.18.mod_filtered.vcf', 'AltaiNea.hg19_1000g.19.mod_filtered.vcf', 'AltaiNea.hg19_1000g.20.mod_filtered.vcf', 'AltaiNea.hg19_1000g.21.mod_filtered.vcf', 'AltaiNea.hg19_1000g.22.mod_filtered.vcf']

for infile in filelist:
    openfile = open(infile, 'r')
    for Line in infile:
        Line=Line.strip('\n')
        if Line[0] != '#':
            OutFile.write(Line + '\n')
    openfile.close()
OutFile.close()

and my output is basically line by line every character in the file names

A
l
t
a I N
e
a
.
h

Anyone know why it is doing this? Thanks!

(I know my code is (if it worked) is even omitting the first header, I plan on adding that back manually).

vcf python • 2.7k views
ADD COMMENT
1
Entering edit mode
cat $(ls -1v  *.vcf) > Merged.vcf # (should concatenate individual vcf files in correct order if I remember correctly. I haven't tested it though)
grep -v "^#" Merged.vcf > Clean.vcf # (removes anything that starts with #)

Now get the header from any vcf file and add it to the top of the Clean.vcf file

ADD REPLY
1
Entering edit mode

Yeah that or this should also work I think (these are both bash shell linux by the way):

for f in /path/to/folder/* do; cat f | grep "^#" >> outfile.vcf; done
for f in /path/to/folder/* do; cat f | grep -v "^#" >> outfile.vcf; done

And if you want only one of the headers, just pick that specific file instead of iterating over the folder.

ADD REPLY
0
Entering edit mode

Just a quick question: do you have access to a linux system?

ADD REPLY
3
Entering edit mode
9.4 years ago
Steven Lakin ★ 1.8k

I'd still recommend using shell as described in the above comments, but to answer your python question, your code is running against your original list:

for infile in filelist:
    openfile = open(infile, 'r')
    for Line in infile:

You called the for statement on the list of strings, not on openfile. Should be this:

for infile in filelist:
    openfile = open(infile, 'r')
    for Line in openfile:

Try this instead:

on Windows:

import glob
filelist = glob.glob(r"C:\my\folder\filepath\*")
with open(r"C:\my\folder\filepath\outFile.vcf", "w") as outFile:
    for file in filelist:
        with open(file, "r") as f:
            for line in f.read().strip():
                if line[0] != "#":
                    outFile.write(line)

On linux you can use normal strings for the filepaths instead of raw strings.

ADD COMMENT

Login before adding your answer.

Traffic: 2012 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6