Hello all
I am trying to retrieve a .fasta file from a large .vcf file
in order to run bcftools' consensus tool i need to bgzip compress the .vcf file and then index it with tabix, nevertheless, when attempting to compress the file largefile.vcf, i get the next error:
"[bgzip] Value too large for defined data type: largefile.vcf"
i tried to compress it using gzip, and it compressed it successfuly, anyways, when i tried to index the resulting largefile.vcf.gz from the gzip compression, i got an error because it is a GZIP file, and not a BGZIP file which is the one needed.
anyone knows why bgzip tool finds the value to large, and not the gzip tool?, i need to get the file bgzip compressed in order to follow my workflow.
Any help would be greately apreciated.
Cheers.
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Value-too-large-for-defined-data-type
It is basically saying that bgzip (binary or compiled from source) on your machine is not compiled to handle large data. Please read the link above for better clarification of the issue.
copy/pasted from GNU website:
Hello cpad0112,
I have been going through the explanation of GNU website for a while now, i understand the problem, and i have been reading about a "lseek" function, which is able to read the length of the file, and changes the offset of the file, something about an "off_t data type", and i'm supposed to write some code to change that on my GNU utilities.
I don't know how to do that, i have been trying, and have updated and upgraded gcc libraries in my computer, but it hasn't worked.
have you got any idea how to define the offset of my file to 64 bits?, or maybe you know how to compile my GNU utilities in order to support large files ?
what is your OS architecture? 64bit? try to recompile bgzip from htslib sources. I could not find source (.tar.gz) file for bgzip.
in the mean time try to do this:
Then try to to index it.
Hello there,
Thanks!, that script appeared at the begining to run neatly, it lasted some hours to deliver a .vcf in a binary configuration. But without the .vcf.gz suffix. just .vcf, and it delivered the following error when trying to index with tabix:
z-VirtualBox:$ tabix Originalfile.hg18.chr4.vcf [E: :get_intv] failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "c" [E: :hts_idx_push] unsorted positions tbx_index_build failed: Originalfile.hg18.chr4.vcf
In order to follow my workflow, i must get a .gz, tabix indexable kind of file.
rename the file to vcf.gz and run tabix on it. Let us know if it still fails. Btw, did you not send the output to a file with vcf.gz extension?
Hello, i did give the .vcf.gz extension, i'll re-run the task to see how it goes, i'll let you know the result, thank you for the interaction. The task lasted about 6 hours, so i'll be posting in a while.
Cheers
If it is taking that long, please try some thing else. I do not want to experiment :) or try with partial vcf. Check if consensus creators (software) takes multiple vcf options. In that case, you can break down your VCF per chromosome and then pass them onto consensus creating software. Btw, i tried the command on dbsnp chr20 (63mb) and it worked fine.. If it is not much work, you can do following as well:
You can write loops for 1,2 and 3: Break down per chromosome and then index the files. Make sure that newly created files are stored in a place other than reference files (fasta and vcf)
That's a neat workflow suggestion, i'll try that on some complete genomes i need to work with, but i must say that the 37GB vcf file that brought the original problem in this post to me is already a one chromosome vcf file ... i think it's huge for it to be just one chr, but it actually is ... i'm really interested on seeing what variants it's got.
thanks again
Hello there,
Well, i re-tried the compression, and it all seemed to go well. i have my compressed vcf.gz file, and i can view it correctly using less, so i think it's all good to go, still, i get the same error when trying to tabix index it.
I already posted my trouble with that, since it is another problem than the one on this threath, the one with bgzip i had was solved with your advice.
Cheers !
Good luck :)