Question

Convert VCF file to mpileup

0

Entering edit mode

3.1 years ago

Drew.Judell • 0

I am working on an iterative analysis that uses orthologous pipelines that require mpileup.txt files as input for a visualization step. This requires me to convert VCF files to mpileup.txt.

This experiment uses error-prone-pcr and follows a somatic variant discovery workflow that was designed by a 3rd party. I have written and implemented several orthologous pipelines with small differences from the original workflow in order to produce a truth set (variants identified consistently by all of the pipelines). The 3rd party used a workflow that calls Samtools to produce mpileup.txt files as the final output and necessitated the development of a custom visualization script. My orthologous pipelines use mutect2 to produce vcf files that need to be compared to the mpileup.txt files.

If I could convert my vcf files to mpileup.txt files without loss of information it would save me significant time cost associated with writing a similar visualization script for vcf files and prevent bias involved in converting the 3rd parties output to vcf format. There are plenty of examples for using samtools and bcftools to produce an intermediate mpileup file that is then converted to a finalized fitered vcf file. But I can't find much direction for converting a vcf file to an mpileup. Can anybody provide direction for this task?

bcftools vcftools samtools mpileup • 1.4k views

ADD COMMENT • link updated 3.1 years ago by Istvan Albert 101k • written 3.1 years ago by Drew.Judell • 0

score 2 · Answer 1 · 2021-10-04

2

Entering edit mode

3.1 years ago

Istvan Albert 101k

It goes against the natural flow of analysis, if you need pileups you should generate them from the BAM file.

Going backward poses various challenges since not all the original pileup information is retained in the VCF file.

You should post a few lines of your VCF file so that we can see what information is there. Even so I think you would need to write a custom program to parse the information and recreate the pileup from allele frequencies and depth of coverage. This may be quite challenging.

ADD COMMENT • link 3.1 years ago by Istvan Albert 101k

0

Entering edit mode

I should probably modify my post to include more information.

This experiment uses error-prone-pcr and follows a somatic variant discovery workflow that was designed by a 3rd party. I have written and implemented several orthologous pipelines with small differences from the original workflow in order to produce a truth set (variants identified consistently by all of the pipelines). The 3rd party used a workflow that calls Samtools to produce mpileup.txt files as the final output and necessitated the development of a custom visualization script. My orthologous pipelines use mutect2 to produce vcf files that need to be compared to the mpileup.txt files. If I could convert my vcf files to mpileup.txt files without loss of information it would save me significant time cost associated with writing a similar visualization script for vcf files and prevent bias involved in converting the 3rd parties output to vcf format.

I cannot legally post the output and I know this the "opposite direction" that this type of a analysis generally follows. Can you elaborate on why there is information loss from mpileup to vcf?

ADD REPLY • link 3.1 years ago by Drew.Judell • 0

1

Entering edit mode

the VCF only contains variants that made it through, whereas the pileup has the raw data,

even more significantly the variants called in the VCF may have gone through a realignment or other adjustment processes and may not represent directly the original pileup anymore.

That is a good thing - the original pileups are simply naive alignments of individual reads, a variant caller should look at all naive reads and produce more accurate variants, possibly relocating some to match the big picture.

For really simple SNPs, some of the tags in the VCF file may contain enough information to create a "pretend" pileup but would be a quite ad-hoc way of doing it. If the variants are really simple then a DP AF would be the number of ALT bases you would need to generate. (DP 1-AF) would be the number of reference bases. But even so, a real pileup would require you to reconstruct the bases on each strand separately.

For anything more complicated I don't think you could reconstruct the pileup.

ADD REPLY • link 3.1 years ago by Istvan Albert 101k

0

Entering edit mode

If your aim is to compare VCF against mpileup.txt to see what differs, I'd argue the more sensible route is a noddy VCF generator from mpileup and compare those (eg with bcftools isec) as such tools exist that way around. Trying to generate mpileup data from the VCF is probably impossible.

However as mentioned above, it's not the natural flow of things and I'm not aware of tools to do this. (Plus good luck searching for it, given mpileup is both a text format but also a command which was traditionally piped into bcftools to generate VCF.)

ADD REPLY • link 3.1 years ago by jkbonfield ★ 1.3k