I am trying to find a way to extract an arbitrary region of human genome from the 1000 genomes project's VCF files without having to download the genome or individual chromosome files up front, and ideally all from within R.
The closest functionality to what I have in mind is provided by the Ensembl DataSlicer (http://mart.ensembl.org/Homo_sapiens/Tools/DataSlicer?db=core), which is an online form. I would like to reproduce this functionality from within R, e.g. to programatically submit a bunch of regions,
and receive 3 VCF files with full 1kgp data (all populations, all SNPs) for these regions. Is this possible? As far as I can tell, I cannot do this without downloading the chromosome-level VCF files first (or the whole genome). I would love to be wrong here. Any ideas would be appreciated!
Thanks @Pierre this is great to know, although it doesn't solve my issue of this being R-self-contained. rsamtools doesn't seem to carry functions equivalent to bcftoolsview, but if it comes to that, I can sall it from shell. But I'd like to avoid that.
Yes, but that implies the users have samtools installed in their system. My users cannot do that, they don't use shell and don't have admin access to their computers...
We have the 3202 samples from the 1kg DRAGEN called high coverage gVCFs on TileDB. If you sign up there we can get you some credits and assistance with your queries.
import tiledb
import tiledbvcf
cloud_uri = "tiledb://TileDB-Inc/gvcf-1kg-dragen-v376"
# This is the config I need to pass
cloud_config = tiledb.Config()
cloud_cfg = tiledbvcf.ReadConfig(tiledb_config=cloud_config)
# Open the VCF dataset in read mode passing the config
ds = tiledbvcf.Dataset(uri=cloud_uri, mode="r", cfg=cloud_cfg)
# read and return Pandas dataframe
df = ds.read(
regions = ["chr1:6543-776543","chr4:987654-100014798","chr22:3497698-5439932"],
attrs = ["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df
TileDB-VCF is python-based but you can easily write a user-defined function that can return an Arrow table which is readable in R. We do this all the time to develop RShiny variant browsers and knitr reports. You don't need to generate VCF as an intermediate.
Thanks @Pierre this is great to know, although it doesn't solve my issue of this being R-self-contained. rsamtools doesn't seem to carry functions equivalent to bcftoolsview, but if it comes to that, I can sall it from shell. But I'd like to avoid that.
run a
system()
command from R ?Yes, but that implies the users have samtools installed in their system. My users cannot do that, they don't use shell and don't have admin access to their computers...
You're asking a commandline streaming filter utility to act as a multi-tenant database. That won't go smoothly.