Entering edit mode
6.2 years ago
shubhra.bhattacharya
▴
140
Hello! I've been trying to generate consensus sequence from .bam file using bcftools and vcftools. The command for which is:
samtools-1.3/samtools mpileup -C50 -uf Eucalyptus_Gene_Sequences.fasta 133.bam.sorted.bam | bcftools-1.9/bcftools call -c | bcftools-1.9/vcfutils.pl vcf2fq -d 8 -D 100 > out.fastq
However it throws an error:
Could not parse the header line: "##contig=<ID=Eucgr.F02756.1_MYB52,BW52,length=1872>"
Could not parse the header line: "##contig=<ID=Eucgr.F03879.1_EMB14,EMB177,EMB33,SUS2,length=7745>"
Could not parse the header line: "##contig=<ID=Eucgr.F03935.1_EMB14,EMB177,EMB33,SUS2,length=7556>"
Could not parse the header line: "##contig=<ID=Eucgr.F04263.1_GUX3,PGSIP2,length=2250>"
Could not parse the header line: "##contig=<ID=Eucgr.G00889.1_REM39,VRN1,length=1879>"
Could not parse the header line: "##contig=<ID=Eucgr.G01977.1_GUT2,IRX10,length=2457>"
All of these contigs have a comma in their name and bcftools cannot parse these. What is a workaround this?
https://github.com/samtools/bcftools/issues/266 reports this but this has not been fixed in the latest version as well (1.9) .
--Thanks
How have you constructed your Eucalyptus_Gene_Sequences.fasta file? Does this come from an Egrandis reference FASTA file that we can download (what is the URL?), and what processing (if any) did you do after downloading to make the Eucalyptus_Gene_Sequences.fasta you're using?
Hello,
does it work if you quote the ID's in the header like this?
fin swimmer
Most likely yes, that will work with bcftools — but there is an extent to which that is a bcftools extension rather than something that is guaranteed in the format, which is vague about quoting strings. (So it may or may not help for other tools reading your VCF.) Moreover, we are moving to forbid commas in particular in reference contig names in SAM and VCF — see https://github.com/samtools/hts-specs/pull/333 et al, hence my question about where this file comes from.
Eucalyptus grandis reference files from ftp://ftp.plantgenie.org/Data/EucGenIE/Eucalyptus_grandis/v2.0/FASTA/ contain header lines like
which ought to lead to contig names like
Eucgr.F02756.1
as the (vague) convention is to stop at whitespace when picking reference sequence names out of FASTA headers. It rather looks like the OP has instead converted spaces to underscores and fallen victim to the comma inMYB52,BW52
in what should have been informally a metadata field rather than part of the contig name. (But I haven't yet found a eucalyptus reference file withMYB52
in it, hence the question about where this comes from.)So either the OP should be truncating the contig names at the space/underscore, or whoever generated the file should probably be converting the commas to underscores or some other character too. How much of a problem this is depends on whether it is the OP doing it locally, or the eucalyptus reference people doing it for reference files made publicly available. The SAM & VCF format folks would particularly like to know about the latter if it is the case, as that could affect this sensible small restriction in the formats.
Hello John Marshall ,
nice to see this. I absolutely agree with this idea. With vcf4.3 it already started that the
INFO
and 'FORMATtags must much the regex
^[A-Za-z ][0-9A-Za-z .]*$`. It would be consequent to do this for contig names as well.fin swimmer
This thread is nearly 3 years old, and I'm getting the same error, so if indeed it is a bug it still hasn't been corrected (to my knowledge, I installed the most recent version)
Since March 2020, commas are disallowed in VCF (v4.3) CONTIG IDs. Whether you are using v4.3 or an earlier version, tools such as bcftools are likely to take advantage of the restriction on commas — so for an easy life you should avoid putting commas in any of your VCF CONTIG IDs.
You don't show your headers, so we don't know if they are exhibiting the same problem as the OP's comma-containing IDs like
Eucgr.F02756.1_MYB52,BW52
. If this is the problem, then it is not a bug — it is a sensible limitation in the VCF format and tools, one that you need to deal with.