speeding up bcftools view

0

Entering edit mode

2.9 years ago

eb13 ▴ 20

Hi all - I have a very large multi sample vcf file which I am trying to subset by a list of sample IDs, however, my current approach is working very slowly (>2hr per chromosome) and I am wondering if there are any tricks to making it run faster with large files? Here is my current approach:

for file in /vcffiles/*.vcf.gz; do
    bcftools view -Oz -S sample_list.txt $file > /output/subset_"${i##*/}" 
done

Thanks in advance for any suggestions!

vcf bcftools • 4.2k views

ADD COMMENT • link 2.9 years ago by eb13 ▴ 20

0

Entering edit mode

Maybe this link is useful: How to parallelize bcftools mpileup with GNU parallel?

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 2.9 years ago by mohammadhassanj ▴ 260

0

Entering edit mode

thank you for your helpful responses!

ADD REPLY • link 2.9 years ago by eb13 ▴ 20

2

Entering edit mode

2.9 years ago

barslmn ★ 2.5k

Another solution with tsp and background processes.

# this sets number of max jobs. Here we use the number of processes. You might want to change this to another number.
tsp -S $(nproc)

# rest is similar. we just add tsp to start of the command and & at the end.
# & at the end calls all the processes at once but tsp queues them and calls them in batches.
for file in /vcffiles/*.vcf.gz; do
    tsp bcftools view -Oz -S sample_list.txt $file > /output/subset_"${i##*/}" &
done

ADD COMMENT • link 2.9 years ago by barslmn ★ 2.5k

1

Entering edit mode

2.9 years ago

Pierre Lindenbaum 166k

let's do it using nextflow, I won't test it so there will be some small bugs, but you get the idea.

	/* author Pierre Lindenbaum */

	params.vcfs="NO_FILE"
	params.samples="NO_FILE"

	workflow {
	each_vcf = Channel.fromPath(params.vcfs).splitText().map{it.trim()}

	c2vcf = CHROMS_IN_VCF(each_vcf)

	xch = EXTRACT_SAMPLE(file(params.samples), c2vcf.out.splitCsv(header:false,sep:',') )

	CONCAT(xch.output.groupTuple())
	}

	process CHROMS_IN_VCF {
	executor "local"
	input:
	val(vcf)
	output:
	path("chroms.tsv"),emit:output
	script:
	"""
	set -o pipefail
	bcftools stats -s "${vcf}" \| awk -F '\t' '{prinft("%s,${vcf}\n",$1);}' > chroms.tsv
	"""
	}


	process EXTRACT_SAMPLE {
	tag "${contig} / ${vcf}"
	cpus 6
	input:
	path(samples)
	tuple val(contig),val(vcf)
	output:
	tuple val(vcf),path("contig.bcf"),emit:output
	script:
	"""
	set -o pipefail
	bcftools view --threads 5 -O u -S "${samples}" "${vcf}" "${contig}" \| bcftools view --min-ac 1 -O -o b contig.bcf
	bcftools index --threads ${task.cpus} contig.bcf
	"""
	}


	process CONCAT {
	tag "${vcf} N=${L.size()}"
	cpus 6
	input:
	tuple val(vcf),val(L)
	output:
	tuple path("${file(vcf).getSimpleName()}.bcf"),emit:output
	script:
	"""
	cat << EOF > tmp.list
	${L.join("\n")}
	EOF
	bcftools concat --threads ${task.cpus} --file-list tmp.list -a -o "${file(vcf).getSimpleName()}.bcf"
	bcftools index --threads ${task.cpus} "${file(vcf).getSimpleName()}.bcf"
	rm tmp.list
	"""
	}

view raw biostars9550377.nf hosted with ❤ by GitHub

ADD COMMENT • link 2.9 years ago by Pierre Lindenbaum 166k

Login before adding your answer.