Handling headers might be a complication, but after converting VCF files to BED, one could use the bedextract
tool to very efficiently split BED files by chromosome, take the multiset union of the per-chromosome BED files with bedops --everything
- and then convert each per-chromosome BED file to VCF.
Using bash
, here's one way to convert person_{1..n}.vcf
to BED files:
$ for vcfFn in `ls person_*.vcf`; do \
vcf2bed < ${vcfFn} > ${vcfFn%.*}.bed; \
done
Then split each BED file:
$ for bedFn in `ls person_*.bed`; do \
for chrName in `bedextract --list-chr ${bedFn}`; do \
bedextract ${chrName} ${bedFn} > ${bedFn%.*}.${chrName}.bed;
done \
done
Note: To speed this up further, note that the "outer" (per-BED) and "inner" (per-chromosome) loop steps are each parallelizable, either with a computational grid or a software approach like GNU Parallel.
Pick one person to get a list of chromosomes, using that list to build the union of the per-chromosome BED files:
$ for chrName in `bedextract --list-chr person_1.bed`; do \
bedops --everything *.${chrName}.bed > everyone_${chrName}.bed; \
done
Next use awk
to re-arrange columns and re-index the coordinates from 0-based, half-open BED into 1-based, closed VCF form. It's the inverse of vcf2bed
- I'll leave that to the reader. Additionally, for each chromosome file result, it would also be necessary to append a header, perhaps built from the unique INFO fields of one or all inputs - a hash table in Python or Perl would deal with that second case pretty easily.
The problem I see with what you are doing is that unless you have a line in your vcf for every single position, there will be a lot of positions that are missing. When you combine samples, and have one sample with a variant, and another sample is blank, you won't know if the other sample really is homozygous reference, or if it's too poor quality to know.
Hi swbarnes2 -
that's a really, really good point. So, let me ask you this... provided that I only ever intend to use this particular set of files for phasing, do you think this would still be an issue?
I care about lots of parts of the analysis, but currently with this subset of files, my only goal is to phase.
The reason I had elected to do this is because that is how the example dataset at BEAGLE is formatted...I was following their lead:
http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/
So, let me ask you this...
If you wanted to phase a few hundred whole genomes, what exactly would you do? Would you not generate multisample VCFs by chromosome? I now have the files in both formats, so I can use either ... my plan was to group my samples together by ethnicity, then phase them based on the 1kG reference files provided through the link, above.
But I fully admit I am not experienced in doing this and would love to have your feedback.
Regards,
vlaufer