Question

Errors when subsetting VCFs by BEDs

0

Entering edit mode

2.8 years ago

louisber2765 • 0

Hi community, I am having trouble subsetting VCFs files based on positions (BED). I've read the following posts:

Extract Sub-Set Of Regions From Vcf File
Retrieve Subset Positions Vcf File

which have been helpful but I am still not able to correctly obtain the regions I am looking for. I have the feeling I am making a mistake somewhere but I cannot see where. I am detailing the steps I follow here:

I first download a sample VCF file. As suggested in previous questions I've asked here, I am using the genome in a bottle sample genomes (NA12878).

wget ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/SupplementaryFiles/HG001_GRCh38_1_22_v4.2.1_all.vcf.gz

I unzip and inspect this file and I can see it is chr1 based.

Next, I have my file with rsid variants(100k-200k variants). As I only have their name, I get their positions in the hg38 genome using g:Profiler SNPense. With this tool, I am able to retrieve the positions in the latest hg38 Ensembl genome build. Since these coordinates are Ensembl-like, I add a chr before the chromosome number and an end position. I end up with a file looking like this:

chr1    103658487   103658487   rs1553200208
chr1    103658557   103658557   rs1553200209
chr1    103660358   103660358   rs756900103
chr1    103660373   103660373   rs1053446789
chr1    103660374   103660374   rs780631463

I was not sure if I had to use 0-based or 1-based bed files, so I basically made both. So, I have another file looking like this:

chr1    103658487   103658488   rs1553200208
chr1    103658557   103658558   rs1553200209
chr1    103660358   103660359   rs756900103
chr1    103660373   103660374   rs1053446789
chr1    103660374   103660375   rs780631463

Next, I used the following tools: bcftools, vcftools, bedtools and tabix to try to subset the original genomes and obtain my candidate variants. All the tools just retrieved only a few hundred regions... which I am especially doubtful about. Here is my code for each of them:

#Tabix: I first make a tbi file and then use the tool to subset my list of variants
tabix -p vcf HG001_GRCh38_1_22_v4.2.1_all.vcf.gz
tabix -R my_bed_regions.txt HG001_GRCh38_1_22_v4.2.1_all.vcf.gz > tabix_output.vcf

#Bcftools
bcftools filter -R my_bed_regions.txt HG001_GRCh38_1_22_v4.2.1_all.vcf.gz > bcftools_output.vcf

#bedtools
bedtools intersect -a HG001_GRCh38_1_22_v4.2.1_all.vcf.gz  -b my_bed_regions.txt -header > bedtools_output.vcf

I tried each of the tools with 1-based and 0-based bed files. None retrieves many regions, just a few (between 100-300).

It's my first time working with these types of files but I do not think that a thousand variants are not found in a genome containing 3 × 10^6 variants. For instance, these 5 variants I am adding here as an example are not found. So I think I have a problem somewhere but I cannot figure it out. Thank you in advance for your insights.

vcf vcftools bcftools bedtools • 1.5k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 2.8 years ago by louisber2765 • 0

score 0 · Answer 1 · 2022-02-04

0

Entering edit mode

2.8 years ago

Pierre Lindenbaum 164k

I was not sure if I had to use 0-based or 1-based bed files

a bed is always 0-based. but looking at

chr1    103660374   103660374   rs780631463

it look likes the file you're using is 1-based.

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306 -D hg38 -e 'select chrom,chromStart from snp151 where name="rs780631463"'
+-------+------------+
| chrom | chromStart |
+-------+------------+
| chr1  |  103660373 |
+-------+------------+

so the right bed should be

chr1    103660373   103660374   rs780631463

ADD COMMENT • link 2.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

But OP said this:

I tried each of the tools with 1-based and 0-based bed files. None retrieves many regions, just a few (between 100-300).

ADD REPLY • link 2.8 years ago by GenoMax 147k

0

Entering edit mode

Yes, I tried with both kinds of files. The result is the same. Only a few handful regions are retrieved from the original VCF. I don't know what the problem could be. I even tried with a different input VCF (the Ashkenazi son and a Human Personal Genome) and using samples that were either hg38 or hg19 genomes.

ADD REPLY • link 2.8 years ago by louisber2765 • 0

0

Entering edit mode

Perhaps that is all there is in terms of overlap? Do you expect more?

ADD REPLY • link 2.8 years ago by GenoMax 147k

0

Entering edit mode

Honestly, I would expect more even just by "chance". I find it strange that none of the three genomes retrieve many regions.

Am I expecting too much for a whole exome? I mean, I would expect to find these 100.000 variants to be there... are they not sequenced? How do people find variants of interest in a genome?

ADD REPLY • link 2.8 years ago by louisber2765 • 0

0

Entering edit mode

well if I peek a random rs# in your list rs780631463 https://gnomad.broadinstitute.org/variant/1-104202996-G-A?dataset=gnomad_r2_1 : is just too rare to be present in gnomad / genome.

How do people find variants of interest in a genome?

please define "variants of interest".

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Oh, thank you for that observation. I haven't thought of doing that. The variants of interest are just a bunch of rsid I received from someone else and was told to find them in a sample whole-exome. I think the selection was made just by retrieving overlapping rsid to a subset of genes of interest. Like, they had candidate genes, I think they retrieved all coding dbSNPs and these are their candidate "variants of interest".

I am learning along the way as well because I've never worked with these type of data before, but now it makes sense that variant callers would only retrieve variant-sites and not same-as-reference-genome sites, right?

Perhaps these sites at dbSNPs are not that variable? That's why nothing is retrieved from the variant-callers? Am I understanding this correctly?

ADD REPLY • link 2.8 years ago by louisber2765 • 0

0

Entering edit mode

but now it makes sense that variant callers would only retrieve variant-sites and not same-as-reference-genome sites, right?

yes

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Just wanted to make the time to thank you all for helping me debug/rethink this. I was so focused on the scripting I didn't thought about the most basic stuff :)

ADD REPLY • link 2.8 years ago by louisber2765 • 0

0

Entering edit mode

Do not delete posts that have received feedback. Interact with the people that have invested effort into helping you, and work towards providing closure to your post.

ADD REPLY • link 2.8 years ago by Ram 44k