Entering edit mode
3.4 years ago
shpak.max
▴
50
I attempted to create a consensus fasta file using bcftools, i.e.
bgzip -c All_SRR_SNP_Clean.vcf > All_SRR_SNP_Clean.vcf.gz
tabix All_SRR_SNP_Clean.vcf.gz
cat $ref| bcftools consensus $vcf_dir/All_SRR_SNP_Clean.vcf.gz > consensus.fasta
where $ref is the path to a Drosophila reference genome fa and the vcf was generated from an mpileup combining 4 different poolseq samples.
I get a parse error message:
[W::bcf_hdr_register_hrec] The type "FLoat" is not supported, assuming "String"
[W::bcf_hdr_parse] Could not parse header line: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR5647735.1.realign.bam SRR8439151.1.realign.bam SRR8439156.1.realign.bam
[E::bcf_hdr_parse] Could not parse the header, sample line not found
Failed to read from /home/mshpak/Lundflies/bams/unsorted/round1/VCF/Three_Files/All_SRR_SNP_Clean.vcf.gz: could not parse header
Several threads from 2-3 years ago referenced similar errors using bcftools, e.g.
Probable bug in bcftools while parsing headers
but they don't indicate a satisfactory resolution. As I have the most recent version of bcftools, it doesn't seem like the problem has been corrected, so is there a patch or work-around available?
please post header lines and if the header entries are huge in number, host the file some where. Try to address the issues like:
[E::bcf_hdr_parse] Could not parse the header, sample line not found
and also I do not understand this path :Failed to read from /home/mshpak/Lundflies/bams/unsorted/round1 /VCF/Three_Files/All_SRR_SNP_Clean.vcf.gz: could not parse header
( a gap between directories.. I am not sure if this is a typo or the input to bcftools is like that.The break in between directories in the path name was a formatting error in my post, not an error in the script.
The vcfs were generated using PoolSNP and are fairly standard in their format, e.g. commented lines followed by:
where SRR...realign.bams are 3 source bam files for the mpileup that I used.
As far as I can tell, the vcf is in the standard format used by bcftools convert (rather than GATK's vcf format)
Check if this tutorial by @finswimmer on consensus generation by bcftools is helpful. Check your files. Still, if you are facing issues and you are confident that you are doing right and program is not behaving well, please reach out to developers. Devs for this tool are responsive and user friendly, IMO. With the data you furnished here, it is not possible to understand what is going on (for me).
VCF is a bit odd in that those "commented lines" aren't comments! They are the headers it is complaining about.
Just because they've been produced using a standard tool doesn't necessarily mean they are correct. :-) It could be an error from either PoolSNP or Bcftools, but without being able to see the data it's impossible to tell where the problem lies.
For what it's worth "sample line not found" appears to be printed when it fails to find "#CHROM\tPOS", but it's a bit convoluted so it may also be a bail out from earlier parsing. (Also note the tab. Please double check it's a tab in your file and not spaces. That's not something we can tell in this medium.)
I verified that the data fields are indeed delimited by /t rather than spaces, so something else may be wrong with the PoolSNP output format (I don't experience this issue when using bcftools on GATK-generated vcfs)