Question

Use tabix with a list of regions

5

Entering edit mode

12.5 years ago

win ▴ 990

Hi all,

I need to use tabix for a project and the following is the syntax that is used, for e.g.

tabix -fhftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz 12:2345295-2345295 > genotypes.vcf

The question I have is if i have a lot of regions in the same chromosome, is there a way that I can supply a file with the regions as input, what would the syntax be? In other words can I just replace the chromosome number and coordinate with a text file name with each region on a new line?

Thanks
Ashwin

tabix • 20k views

ADD COMMENT • link updated 4.5 years ago by mg ▴ 290 • written 12.5 years ago by win ▴ 990

score 9 · Answer 1 · 2012-07-12

9

Entering edit mode

12.5 years ago

JC 13k

Put your regions in a text file (one per line) and stream to tabix:

xargs -a regions.txt -I {} tabix -fh  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz {} >> genotypes.vcf

ADD COMMENT • link 12.5 years ago by JC 13k

0

Entering edit mode

i just realized another issue, the vcf files i have are annotated with rs#, whereas the above approach will search using chromosomal coordinates, i wonder is there was a way to search by variant name using the above approach.

ADD REPLY • link 12.5 years ago by win ▴ 990

0

Entering edit mode

you can convert the rs# to coordinates in dbSNP

ADD REPLY • link 12.5 years ago by JC 13k

0

Entering edit mode

hi, i just tried to run in this command in a command window in ubuntu 12 and i get a message

[kftpconnectfile] 350 Restarting at 0. Send STORE or RETRIEVE to initiate transfer [main] fail to open the data file.

the command i typed was

xargs -a Chr12.txt -I {} tabix -fh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz {} >> genotypes.vcf

any ideas?

ADD REPLY • link 12.5 years ago by win ▴ 990

0

Entering edit mode

yes it does, remove that damn < > (I removed them from the response)

ADD REPLY • link 12.5 years ago by JC 13k

0

Entering edit mode

hi jc, all of your input has certainly helped, but i had another question. When i run the aboe command i get a header line for each of the variant that was found. is it not possible to have just one header line accross the entire VCF file?

ADD REPLY • link 12.5 years ago by win ▴ 990

0

Entering edit mode

you can remove the header removing the -h flag, but that removes all the headers, to obtain just one header, first generate the output file with a null base:

# generating the header
tabix -fh  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz chr1:1 > genotypes.vcf
# then you can call all your positions
xargs -a regions.txt -I {} tabix -f  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz {} >> genotypes.vcf

ADD REPLY • link 12.4 years ago by JC 13k

Ram · Answer 2 · 2014-11-10

6

Entering edit mode

10.1 years ago

Leonor Palmeira 3.9k

Tabix has a -B option which allows you to input a .bed file of regions of interest:

tabix -fhB ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz regions.bed > genotypes.vcf

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 10.1 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

I'm trying to get tabix to read in a .bed file as you say, so far without success. What should the format of the .bed lines be? I have tried multiple formats, including this specification at UCSC as well as the format suggested in an answer to similar Biostars question. Also, what about the comment by Rm on that answer, claiming that the -B option has been dropped from tabix? Is that out of date?

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.1 years ago by Greg P ▴ 70

0

Entering edit mode

ask a new question for your data please + show us a few lines of your bed.

ADD REPLY • link 10.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I apologize, my only problem is that I was not careful in reading what people write. I thought that the .bed filename was part of the -B option. Of course, that's not how it works. The -B option just tells tabix to look for a bed file instead of 'region1'.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.1 years ago by Greg P ▴ 70

0

Entering edit mode

@Greg P: I think Rm is talking about the outdated manual available here, I haven't seen that the -B option has been dropped (wouldn't make much sense to me either).

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.1 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

-B has been dropped, and since version 1.2, tabix has -R option instead. See my answer for more info.

ADD REPLY • link 4.5 years ago by mg ▴ 290

Ram · Answer 3 · 2012-07-12

4

Entering edit mode

12.5 years ago

cl.parallel ▴ 140

Yet another way would be a BASH script as:

while read line;do
    tabix -fh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz $line >> genotypes.vcf
done < coord_file

but I like the simplicity of the use of xargs in JC's answer otherwise.

ADD COMMENT • link 12.5 years ago by cl.parallel ▴ 140

0

Entering edit mode

thanks! I see you're in Seattle, where? I'm at ISB

ADD REPLY • link 12.5 years ago by JC 13k

0

Entering edit mode

UW Microbiology

ADD REPLY • link 12.5 years ago by cl.parallel ▴ 140

0

Entering edit mode

well no so far, maybe we can chat sometime

ADD REPLY • link 12.4 years ago by JC 13k

0

Entering edit mode

Hi,

Thanks for this. I suppose this will generate as many files as the coord_file has right? Please let me know if I'm wrong. I'm trying to use tabix to extract information from vcf files for the coordinates I want, based on a text file with all my regions. I need to have as many files as my region file has. e.g if I have 100 regions, I will have 100 files as well. So in the example above, I will have 100 genotypes.$line.vcf.

Thanks

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by SGMS ▴ 130

Ram · Answer 4 · 2012-07-12

1

Entering edit mode

12.5 years ago

Pierre Lindenbaum 164k

I would rather curl the vcf and pipe the output with bedtools intersect

something like (I cannot check my command I'm away from my lab):

curl -s "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz"  | 
gunzip -c |
intersectBed -a stdin  -u -b list.bed

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 12.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I think it's better to use tabix. Curl forces you to download the whole file locally; while tabix can take advantage of the tabix indexes (which are already in the 1000genomes ftp), and download only the regions of interest.

ADD REPLY • link 12.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I don't think there is any magic on the server side. Tabix download the data and he knows how many bits he should 'skip' and when he should 'stop' scanning. But if all your regions are at the bottom (e.g chrY), it will have to fetch everything each time.

ADD REPLY • link 12.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

The magic is in the .tbi files. These contains indexes for positions in the vcf files, so tabix doesn't have to download the whole file and scan it. The 1000genomes website already contains some .tbi files, which are detected automatically when you call the tool. Look at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

ADD REPLY • link 12.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Giovanni, I agree. But once tabix has downloaded the index, it must download the data itself from the server until it get the right offset.

ADD REPLY • link 12.5 years ago by Pierre Lindenbaum 164k

score 1 · Answer 5 · 2012-07-13

This is a python script that I wrote as a wrapper to call tabix on a list of gene coordinates:

Tabix 1000genomes wrapper

However, the xargs solution is more elegant. The only advantage of my code is that it doesn't download files if they are already present in the directory, so you can interrupt the download and continue it later, or add genes to the list after downloading them all.

score 1 · Answer 6 · 2020-06-18

Since version 1.2, tabix uses -R option to accept file with regions of interest. From tabix's commandline help message:

-R, --regions FILE restrict to regions listed in the file

And tabix's doc has more info on how this file should look like:

-R, --regions FILE.      Restrict to regions listed in the FILE. The FILE can be BED file 
      (requires .bed, .bed.gz, .bed.bgz file name extension) or a TAB-delimited file with 
      CHROM, POS, and, optionally, POS_TO columns, where positions are 1-based and inclusive. 
      When this option is in use, the input file may not be sorted.

Note that, unlike typical bed file (where START is zero-based), both POS and POS_TO are one-based.