New to bioinformatics here.
I would like to pull a vcf from TCGA and have a command to do so using curl. I don't want the whole vcf file, but rather a specified region. Is there a way that I can pipe output from cul to tabix without having curl locally download the whole vcf file?
My current command is as follows:
module load google-cloud-sdk; curl --header "X-Auth-Token:$token" "https://api.gdc.cancer.gov/data/${file_id}" | tabix -h chr1:XXXXX-XXXXX > /desitnation/${file_id}.sliced.vcf; done
What might complicate this is HTTPS support (which might now kinda be in
tabix
) and support for authentication tokens or other custom HTTP headers (which might not). I would be interested to know what the status is of that support, but just noting that they might be issues.It might be possible to use Node.js to set up an HTTP > HTTPS proxy service. This is basically setting up a local HTTP server that points to the remote HTTPS service running on
api.gdc.cancer.gov
. Requests to the HTTP service are unauthenticated, but the proxy service passes along the custom authentication token header.In other words, one would then run
tabix
to point to the VCF file "hosted" on that local, unauthenticated HTTP proxy, e.g.,http://localhost/${file_id}
.On the proxy side,
http://localhost/${file_id}
gets swapped out forhttps://api.gdc.cancer.gov/data/${file_id}
.Any requests to the proxy are, in turn, now-authenticated requests for data from the original HTTPS service — as far as
tabix
is concerned, it is just talking to an HTTP server.This includes requests for the index file, which are then turned into requests for byte ranges from the original bgzip file — the proxy would need to be configured to pass along any such byte-range headers that
tabix
puts into its request.