Hi community, I am having trouble subsetting VCFs files based on positions (BED). I've read the following posts:
- Extract Sub-Set Of Regions From Vcf File
Retrieve Subset Positions Vcf File
which have been helpful but I am still not able to correctly obtain the regions I am looking for. I have the feeling I am making a mistake somewhere but I cannot see where. I am detailing the steps I follow here:
I first download a sample VCF file. As suggested in previous questions I've asked here, I am using the genome in a bottle sample genomes (NA12878).
wget ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/SupplementaryFiles/HG001_GRCh38_1_22_v4.2.1_all.vcf.gz
I unzip and inspect this file and I can see it is chr1 based.
Next, I have my file with rsid variants(100k-200k variants). As I only have their name, I get their positions in the hg38 genome using g:Profiler SNPense. With this tool, I am able to retrieve the positions in the latest hg38 Ensembl genome build. Since these coordinates are Ensembl-like, I add a chr before the chromosome number and an end position. I end up with a file looking like this:
chr1 103658487 103658487 rs1553200208
chr1 103658557 103658557 rs1553200209
chr1 103660358 103660358 rs756900103
chr1 103660373 103660373 rs1053446789
chr1 103660374 103660374 rs780631463
I was not sure if I had to use 0-based or 1-based bed files, so I basically made both. So, I have another file looking like this:
chr1 103658487 103658488 rs1553200208
chr1 103658557 103658558 rs1553200209
chr1 103660358 103660359 rs756900103
chr1 103660373 103660374 rs1053446789
chr1 103660374 103660375 rs780631463
Next, I used the following tools: bcftools
, vcftools
, bedtools
and tabix
to try to subset the original genomes and obtain my candidate variants. All the tools just retrieved only a few hundred regions... which I am especially doubtful about. Here is my code for each of them:
#Tabix: I first make a tbi file and then use the tool to subset my list of variants
tabix -p vcf HG001_GRCh38_1_22_v4.2.1_all.vcf.gz
tabix -R my_bed_regions.txt HG001_GRCh38_1_22_v4.2.1_all.vcf.gz > tabix_output.vcf
#Bcftools
bcftools filter -R my_bed_regions.txt HG001_GRCh38_1_22_v4.2.1_all.vcf.gz > bcftools_output.vcf
#bedtools
bedtools intersect -a HG001_GRCh38_1_22_v4.2.1_all.vcf.gz -b my_bed_regions.txt -header > bedtools_output.vcf
I tried each of the tools with 1-based and 0-based bed files. None retrieves many regions, just a few (between 100-300).
It's my first time working with these types of files but I do not think that a thousand variants are not found in a genome containing 3 × 10^6 variants. For instance, these 5 variants I am adding here as an example are not found. So I think I have a problem somewhere but I cannot figure it out. Thank you in advance for your insights.
But OP said this:
Yes, I tried with both kinds of files. The result is the same. Only a few handful regions are retrieved from the original VCF. I don't know what the problem could be. I even tried with a different input VCF (the Ashkenazi son and a Human Personal Genome) and using samples that were either hg38 or hg19 genomes.
Perhaps that is all there is in terms of overlap? Do you expect more?
Honestly, I would expect more even just by "chance". I find it strange that none of the three genomes retrieve many regions.
Am I expecting too much for a whole exome? I mean, I would expect to find these 100.000 variants to be there... are they not sequenced? How do people find variants of interest in a genome?
well if I peek a random rs# in your list rs780631463 https://gnomad.broadinstitute.org/variant/1-104202996-G-A?dataset=gnomad_r2_1 : is just too rare to be present in gnomad / genome.
please define "variants of interest".
Oh, thank you for that observation. I haven't thought of doing that. The variants of interest are just a bunch of rsid I received from someone else and was told to find them in a sample whole-exome. I think the selection was made just by retrieving overlapping rsid to a subset of genes of interest. Like, they had candidate genes, I think they retrieved all coding dbSNPs and these are their candidate "variants of interest".
I am learning along the way as well because I've never worked with these type of data before, but now it makes sense that variant callers would only retrieve variant-sites and not same-as-reference-genome sites, right?
Perhaps these sites at dbSNPs are not that variable? That's why nothing is retrieved from the variant-callers? Am I understanding this correctly?
yes
Just wanted to make the time to thank you all for helping me debug/rethink this. I was so focused on the scripting I didn't thought about the most basic stuff :)
Do not delete posts that have received feedback. Interact with the people that have invested effort into helping you, and work towards providing closure to your post.