Hello everyone
Do anyone know any tool or script to edit the header and add a "chr" prefix to the chromosome name header directly in a .vcf.gz file? I don't want even to uncompress it, since I have the experience that a huge .vcf can't be compressed by bgzip, at least in my computer.
I am working with some genomic files in the .bam and .vcf format, I tried to retrieve some genic regions, and I already sorted that out, beginning with a large one human chromosome .bam file.
When working with that whole chromosome file, I realized that the header had the name of the chromosome without the prefix "chr", only the number, and that gave me a hard time when trying to run mpileup just in the middle of the workflow.
Now, following my working path, I got a big one chromosome .vcf.gz file, which I indexed with tabix in order to retrieve the desired region with ease, but I get the same problem as before, the name of the chromosome is lacking the "chr" prefix, which happens to be not compatible with the .fasta reference file it needs to run the command, just now beginning with a .vcf.gz file.
I thought about going back from .vcf.gz to bam, on which I already know the syntax to edit the headers, but that means doing around four file transformations. That will spend lots of time.
Thanks in advance for any orientation.
without uncompressing it ? no, you cannot.
otherwise, you can use
sed
I don't think the OP realises that reheadering a VCF will do nothing. Like SAM, the chromosome name in plain text string format in every row. It seems they're looking for a VCF equivalent of
samtools reheader -i
. That trick would only work for a BCF, and even then, it would break variants in breakend notation.Thanks for the observation, yes, as you said, headers had nothing to do with the problem, i had to re-name the chromosome column, which is the first one on the VCF file, i solved the problem handling the text in the file using awk.
alright, I guessed I couldn't do it without decompressing, but I was just describing the trouble. I am trying your script right now, as I read it, it should give an output in the .vcf.gz format, going through an intermediate decompressing... during which prefix chr should be added, right?
... I used that sed command before, in a bam file, just chatting a bit in order to get the grip on what is being done to the files.
greetings
There was another post on Biostars on this topic too: VCF files: Change Chromosome Notation
You could also avoid decompressing the vcf.gz by just using bcftools:
That will input vcf.gz and output vcf.gz
Still doesn't avoid decompressing. There is literally no way to read a gzip file without decompressing from the first byte.
EDIT: Unless I'm mistaken on the compression algorithm used by bgzip and it's not LZ77-based.
You're right but I meant explicitly decompressing with gunzip.
gunzip -c
is essentially the same asbcftools view
thoughIn this context, yeah. I don't know why OP wants to avoid decompression. Maybe avoid a decompressed intermediate file by streaming it?
Decompressing and recompressing with gzip is slow as it is typically limited to one core (although see
pigz
andpbgzip
). If the VCF were block compressed it would technically be possible to decompress just the block(s) containing the VCF header, write it out with the edited header, and then directly concatenate the subsequent blocks without any decompression/recompression of those blocks (this is the trick that samtools reheader uses), but AFAIK no tool currently implements this for VCF.Edit: bcftools looks to have a reheader command - I don't know if it uses the speed trick though
That is useful information! Thank you!
( posted as answer )