1000G Tabix download: EOF marker is absent
2
0
Entering edit mode
3.6 years ago
Guilherme ▴ 40

I want to download the last release of the phased 1000Genomes (high coverage), that it is in the hg38 build but only for a set of samples (203 samples to pre precise)...

I have used the command line:

tabix -h http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased.vcf.gz chr1 | vcf-subset -c Sample_1kgp.txt | bgzip -c > CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased_out.vcf.gz

It starts to download but then eventually I get this error that appears do be random (sometimes I get it right after starting the download and sometimes it past already 1 hour and then this happens).

[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
Broken VCF: empty columns (trailing TABs) starting at chr1:35966205.
Wrong number of fields; expected 3211, got 1926.

and this error at the end too:

at /usr/local/Cellar/vcftools/0.1.16/lib/perl5/site_perl/Vcf.pm line 172, <STDIN> line 968801.
    Vcf::throw(Vcf4_1=HASH(0x7fdcbe8b2c40), "Wrong number of fields; expected 3211, got 1926. The offendin"...) called at /usr/local/Cellar/vcftools/0.1.16/lib/perl5/site_perl/Vcf.pm line 507
    VcfReader::next_data_hash(Vcf4_1=HASH(0x7fdcbe8b2c40)) called at /usr/local/Cellar/vcftools/0.1.16/lib/perl5/site_perl/Vcf.pm line 3479
    Vcf4_1::next_data_hash(Vcf4_1=HASH(0x7fdcbe8b2c40)) called at /usr/local/Cellar/vcftools/0.1.16/libexec/bin/vcf-subset line 146
    main::vcf_subset(HASH(0x7fdcbd8243c0)) called at /usr/local/Cellar/vcftools/0.1.16/libexec/bin/vcf-subset line 12

Any inputs to solve this?

Thanks

1000Genomes Tabix vcf-subset bgzip • 3.0k views
ADD COMMENT
0
Entering edit mode

I've tried to do this before (using bcftools) and eventually gave up because I could find no way to keep the connection open.

ADD REPLY
2
Entering edit mode
3.6 years ago
Mensur Dlakic ★ 28k

It sounds like an incomplete download. I suggest you download the file and run it with file name instead of URL.

ADD COMMENT
0
Entering edit mode

I see, I wanted to avoid having to download the whole chromosome because of space issues and that's why I wanted to use Tabix to download only the portion (in this case the populations), I wanted...

But as you said, I ended up downloading the full .vc.gz files and using Tabix afterwards...

ADD REPLY
2
Entering edit mode
3.6 years ago

Your code tries to download all chr1 data through tabix, and pipe it to vcf-subset and to bgzip internally, which is not really efficient.

I would suggest to do it all at once through bcftools:

bcftools view -Oz -S Sample_1kgp.txt -o CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased_out.vcf.gz http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased.vcf.gz chr1

Having said that, I've tried it myself with the first 2 samples (HG00096 and HG00097) and I've received a similar error message when I first ran it (it ended successfully in ~30 minutes when I tried it again):

[E::bgzf_read_block] Failed to read BGZF block data at offset 26325506 expected 3300 bytes; hread returned 2888
[E::vcf_parse_format] Couldn't read GT data: value not a number or '.' at chr1:2195994
Error: VCF parse error

I discard a problem with RAM usage because bcftools is perfectly able to work fine with a minimum memory footprint, and I agree that there must be a problem with the data retrieval. It looks like querying such large files can be demanding from the server, and maybe the server was not capable of responding properly. In fact, a simple sample name query such as bcftools query -l remotefile failed when I first ran it, but worked when I ran it again a few seconds later.

ADD COMMENT

Login before adding your answer.

Traffic: 2509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6