How to merge 20K single-sample VCFs *without* using plink or plink2?

0

Entering edit mode

2.9 years ago

kynnjo ▴ 70

I have ~20K single-sample VCFs, all covering the same sites, that I want to merge into a single ~20K-sample VCF.

In the past, I have used plink/plink2 to do this sort of merge, but I'm having trouble keeping REF/ALT alleles and ploidy straight[1].

I am looking for a tool that (unlike plink 1.x) does not mess at all with the values of the REF and ALT columns (even for indel variants). I'm not sure what's going on with chrX ploidy (0/0 genotypes getting converted to 0, 1/1 to 1, etc.), but I'd prefer a tool that (unlike plink2) automatically keeps the genotypes in my input VCFs.

Thank you in advance!

[1] The original version of this post included the plausibly-incorrect claim that plink/plink2 was designed for use-cases that are significantly different from the ones I am interested in. See @chrchang523's replies below for details on why my use-case might or might not be covered. Regardless, for this question, I do not want to elicit responses that aim at fixing my use of plink/plink2. My goal is to find a tool that is a better fit to my work.

genotyping vcf • 3.4k views

ADD COMMENT • link updated 2.8 years ago by chrchang523 11k • written 2.9 years ago by kynnjo ▴ 70

2

Entering edit mode

use bcftools merge.

ADD REPLY • link 2.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I am trying to merge ~20K single-sample VCF files. Each of these files has around 1.7M sites.

Can bcftools merge handle a merge of this many files quickly (i.e. in under 24h)?

The way I do this sort of thing within a reasonable time (though, unfortunately, I use plink/plink2 for this) is to convert the ~20K VCF files to BED files, I then merge the ~20K BED files (which is much quicker than merging the ~20K VCF files directly), and finally I convert the merged BED to VCF.

(I use plink for the first two steps above, and plink2 for the final step. Unfortunately, the final VCF has problems, as I mentioned.)

ADD REPLY • link 2.9 years ago by kynnjo ▴ 70

2

Entering edit mode

Can bcftools merge handle a merge of this many files quickly (i.e. in under 24h)?

sure, by parallelizing per chromosome, first merge per group of sqrt(20,000)=142 files and the merge the 142 files to create the final vcf per chromosome, and at the end , use bcftools concat to concatenate each chromosome.

ADD REPLY • link 2.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I am intrigued by this idea, but I am unable to fill in the omitted details. Would you mind posting somewhat more explicit code?

ADD REPLY • link 2.8 years ago by kynnjo ▴ 70

0

Entering edit mode

see my answer below... How to merge 20K single-sample VCFs *without* using plink or plink2?

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

Please clarify that you were making the conscious choice to continue using plink 1.x as part of your old workflow, and that that was the cause of your REF/ALT issues. One of the main reasons for plink 2.0’s existence is addressing that problem. (Due to the simplicity of the .bed format, the currently-missing plink2 merge functionality can be replaced with a short shell script in this case.)

But as for 0/0 -> 0 and the like, yes, if your VCFs have fine-grained ploidy information that you want to keep, that is indeed outside of plink2’s scope. bcftools is the standard solution here.

ADD REPLY • link 2.9 years ago by chrchang523 11k

0

Entering edit mode

If you knew how much I resisted and fought using two different versions of plink in the same pipeline, you may have guessed that I must not have been able to find a way around it.

Indeed, the parts of the pipeline that I have been doing with plink 1.x are ones for which I have found explicit references online that they cannot be done with plink2 at the moment. I don't have these references onhand at the moment, but I remember that it was only after finding these explicit references that I finally threw in the towel, and accepted the necessity of using both plink 1.x and plink2 in the same pipeline.

ADD REPLY • link 2.8 years ago by kynnjo ▴ 70

1

Entering edit mode

As mentioned in my previous comment, the merge step that you're currently using plink 1.x for, because it "cannot be done with plink2 at the moment", can be done with a short shell script in your case, because all your files have the same variants in the same order. The key is to export the sample-major .bed format, rather than the regular variant-major format. From there, a for-loop involving cat and tail -c +4 does almost everything you need.

Any other commands where plink 1.x must be used instead of plink2 should be usable in a way where your REF/ALT allele ordering is unaffected.

I will assume for now that you really do have fine-grained ploidy information in your VCF that can't be handled properly by plink2, rather than the far more common scenario of pseudoautosomal vs. non-pseudoautosomal regions of chrX which is handled properly by --split-par during VCF import. Again, if you do have fine-grained ploidy information, then plink2 is not the right tool for the job; you won't be able to use the cat / tail -c +4 merge script that would otherwise be >100x as efficient as bcftools merge.

But please edit the question to make it clear that plink2 does not flip REF/ALT alleles. It's fine if you don't want to continue using plink even if it does cover your use cases -- especially if it's due to poor documentation/support that makes it impractical for you to use the program properly! -- but a major purpose of this forum is to provide accurate information to others, so I will edit away the misleading parts of your question if you don't.

ADD REPLY • link 2.8 years ago by chrchang523 11k

0

Entering edit mode

Thank you for your comments. I find them extremely helpful and informative.

I'm happy to correct the errors in my original post. Is this rewording of what I wrote in that post accurate?

I am looking for a tool that (unlike plink 1.x) does not mess at all with the values of the REF and ALT columns (even for indel variants), and that (unlike plink2) does not mess with genotypes (e.g. convert 0/0 to 0, 1/1 to 1, etc.), irrespective of which chromosome these genotypes appear in.

ADD REPLY • link 2.8 years ago by kynnjo ▴ 70

0

Entering edit mode

The first part is fine.

As for the second part, it would be useful to clarify how your chrX use case differs from the norm, because plink2 --split-par does in fact handle the normal chrX use case properly and you haven't given any indication so far that you understand this; you've only said that it "mess[es] with the genotypes" without describing the pattern. (It seems likely that the actual problem is that plink2 does not print a warning/error when --vcf is used without --split-par on a chrX-containing file; this will be addressed in the next release.)

ADD REPLY • link 2.8 years ago by chrchang523 11k

0

Entering edit mode

I wish I understood plink 1.x and plink2 well enough to write what you'd like me to write, but unfortunately I don't. (In fact, I suspect that if I had this level of understanding, it is pretty unlikely that I would have had to post this question in the first place, because I would have been able to Google for its answer far more precisely and effectively.)

Instead, I've done the best I can do, namely to alert the readers of my post to these shortcomings, and point them to your replies in this thread.

ADD REPLY • link 2.8 years ago by kynnjo ▴ 70

1

Entering edit mode

Okay, I have edited your post to remove misleading claims about plink/plink2, while keeping the "goal is to find a tool that is a better fit to my work" bottom line.

ADD REPLY • link 2.8 years ago by chrchang523 11k

2

Entering edit mode

2.9 years ago

Pierre Lindenbaum 166k

ok, using NEXTFLOW

	/* author Pierre Lindenbaum PhD Insitut du Thorax. Nantes. France */

	params.vcfs = ""

	workflow {
	main:
	merged_ch = mergeBcfs(params,params.vcfs)
	}

	workflow mergeBcfs {
	take:
	meta
	vcfs
	main:
	contigs_ch = all_chromosomes(meta,vcfs)
	groups_ch = split_list(meta,vcfs)


	each_contig = contigs_ch.contigs.splitCsv(header: false,sep:'\t',strip:true).map{T->T[0]}
	each_group = groups_ch.groups.splitCsv(header: false,sep:'\t',strip:true).map{T->T[0]}

	group_contig_ch = merge_group_contig(meta,each_group.combine(each_contig))
	contig_ch = merge_contig(meta,group_contig_ch.contig_vcf.groupTuple())
	final_ch = merge_all(meta,contig_ch.vcf.collect())
	emit:
	vcf = final_ch.vcf
	index = final_ch.csi
	}

	process all_chromosomes {
	tag "${file(vcfs).name}"
	input:
	val(meta)
	val(vcfs)
	output:
	path("chroms.txt"),emit:contigs
	script:
	"""
	set -o pipefail
	cat ${vcfs} \| while read V; do bcftools index -s "${V}" \| cut -f 1 ; done \| sort -T . \| uniq > chroms.txt
	"""
	}


	process split_list {
	executor "local"
	tag "${file(vcfs).name}"
	input:
	val(meta)
	val(vcfs)
	output:
	path("groups.list"),emit:groups
	script:
	def min_file_split = meta.min_file_split?:"100"
	"""
	SQRT=`awk 'END{X=NR;if(X<${min_file_split}){print(X);} else {z=sqrt(X); print (z==int(z)?z:int(z)+1);}}' "${vcfs}"`

	split -a 9 --additional-suffix=.list --lines=${SQRT} "${vcfs}" chunck.

	find ${PWD} -type f -name "chunck.*.list" > groups.list
	"""
	}


	process merge_group_contig {
	tag "${contig} ${file(group).name}"
	cpus 1
	input:
	val(meta)
	tuple val(group),val(contig)
	output:
	tuple val(contig),path("merged.0.bcf"),emit:contig_vcf
	path("merged.0.bcf.csi"),emit:csi
	script:
	"""
	bcftools merge --threads ${task.cpus} --file-list "${group}" --regions "${contig}" \
	-O b -o merged.0.bcf

	bcftools index --threads ${task.cpus} merged.0.bcf
	"""
	}

	process merge_contig {
	tag "${contig} N=${L.size()}"
	cpus 1
	input:
	val(meta)
	tuple val(contig),val(L)
	output:
	path("merged.1.bcf"),emit:vcf
	path("merged.1.bcf.csi"),emit:csi
	script:
	"""
	cat <<- EOF > tmp0.list
	${L.join("\n")}
	EOF


	# assert files are sorted in the same sample order for each contig
	xargs -a tmp0.list -L 1 echo \| while read F ; do bcftools query -l "${F}" \|\
	awk -vF=${F} '(NR==1){printf("%s,%s\n",$1,F);}' ; done \|\
	sort -t ',' -T . -k1,1 \| cut -d, -f 2 \| uniq > tmp.list

	bcftools merge --threads ${task.cpus} --file-list "tmp.list" --regions "${contig}" \
	-O b -o merged.1.bcf

	bcftools index --threads ${task.cpus} merged.1.bcf

	rm tmp0.list tmp.list
	"""
	}

	process merge_all {
	tag "N=${L.size()}"
	cpus 1
	input:
	val(meta)
	val(L)
	output:
	path("merged.bcf"),emit:vcf
	path("merged.bcf.csi"),emit:csi
	script:
	"""
	cat <<- EOF > tmp.list
	${L.join("\n")}
	EOF

	bcftools concat --no-version \
	--allow-overlaps --remove-duplicates \
	-O b -o "merged.bcf" --file-list tmp.list

	bcftools index --threads ${task.cpus} merged.bcf

	rm tmp.list
	"""
	}

view raw biostar9526679.nf hosted with ❤ by GitHub

Usage

find path/to/dir -type f -name "S*.vcf.gz" > jeter.list
nextflow run --vcfs ${PWD}/jeter.list biostar9526718.nf

add -C config.cfg toconfigure your cluster config....

view raw Usage.md hosted with ❤ by GitHub

ADD COMMENT • link 2.9 years ago by Pierre Lindenbaum 166k

Login before adding your answer.