How to query 1000 genomes project VCF files for specific regions without downloading whole chromosomes first?
3
1
Entering edit mode
10 months ago
kilojarek ▴ 10

Hi,

I am trying to find a way to extract an arbitrary region of human genome from the 1000 genomes project's VCF files without having to download the genome or individual chromosome files up front, and ideally all from within R.

The closest functionality to what I have in mind is provided by the Ensembl DataSlicer (http://mart.ensembl.org/Homo_sapiens/Tools/DataSlicer?db=core), which is an online form. I would like to reproduce this functionality from within R, e.g. to programatically submit a bunch of regions,

1:6543-776543
4:987654-100014798
22:3497698-5439932

and receive 3 VCF files with full 1kgp data (all populations, all SNPs) for these regions. Is this possible? As far as I can tell, I cannot do this without downloading the chromosome-level VCF files first (or the whole genome). I would love to be wrong here. Any ideas would be appreciated!

Best
kJ

r ensembl dataslicer vcf 1000genomes • 2.4k views
ADD COMMENT
3
Entering edit mode
10 months ago

bcftools works with remote indexed VCF

bcftools view  "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.recalibrated_variants.vcf.gz"  "chr1:10474-10480"
ADD COMMENT
0
Entering edit mode

Thanks @Pierre this is great to know, although it doesn't solve my issue of this being R-self-contained. rsamtools doesn't seem to carry functions equivalent to bcftoolsview, but if it comes to that, I can sall it from shell. But I'd like to avoid that.

ADD REPLY
1
Entering edit mode

run a system() command from R ?

ADD REPLY
0
Entering edit mode

Yes, but that implies the users have samtools installed in their system. My users cannot do that, they don't use shell and don't have admin access to their computers...

ADD REPLY
0
Entering edit mode

You're asking a commandline streaming filter utility to act as a multi-tenant database. That won't go smoothly.

ADD REPLY
2
Entering edit mode
10 months ago
cmdcolin ★ 3.9k

cross referenced from reddit post

possible "pure R"/Rsamtools based approach

library(Rsamtools)
tbx1 <- TabixFile("https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz")
tbx2 <- TabixFile("https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr4.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz")
tbx3 <- TabixFile("https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz")

param1 <- GRanges(
    seqnames = "1",
    ranges = IRanges(6543, end = 776543),
)

param2 <- GRanges(
    seqnames = "4",
    ranges = IRanges(987654, end = 100014798),
)

param3 <- GRanges(
    seqnames = "22",
    ranges = IRanges(3497698, end = 5439932),
)
res1 <- scanTabix(tbx1, param=param1)
res2 <- scanTabix(tbx2, param=param2)
res3 <- scanTabix(tbx3, param=param3)
ADD COMMENT
1
Entering edit mode
10 months ago

We have the 3202 samples from the 1kg DRAGEN called high coverage gVCFs on TileDB. If you sign up there we can get you some credits and assistance with your queries.

https://tiledb.com/blog/tiledb-101-population-genomics

import tiledb
import tiledbvcf
cloud_uri = "tiledb://TileDB-Inc/gvcf-1kg-dragen-v376"

# This is the config I need to pass
cloud_config = tiledb.Config()
cloud_cfg = tiledbvcf.ReadConfig(tiledb_config=cloud_config)

# Open the VCF dataset in read mode passing the config
ds = tiledbvcf.Dataset(uri=cloud_uri, mode="r", cfg=cloud_cfg)

# read and return Pandas dataframe
df = ds.read(
    regions = ["chr1:6543-776543","chr4:987654-100014798","chr22:3497698-5439932"],
    attrs   = ["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

TileDB-VCF is python-based but you can easily write a user-defined function that can return an Arrow table which is readable in R. We do this all the time to develop RShiny variant browsers and knitr reports. You don't need to generate VCF as an intermediate.

Feel free to reach out to me directly.

ADD COMMENT

Login before adding your answer.

Traffic: 2333 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6