How to do SNP selection in large VCF/BCF/GenotypeTables in R like you can using BCFTools,CYVCF2 (fast python VCF/BCF parser) or Excel?
Support for at least the following type of filter criteria is required:
- Filtering on Region of Interest
- Filtering on Samples of interest
- Filtering on Variant Quality (QUAL)
- Filtering on Genotype call rate (INFO/AN / (number of samples*2))
- Number of alternative alleles (INFO/AC)
- Filtering on alternative allele frequency (INFO/AF)
- Samples required to HOM_REF,HET,HOM_ALT
- Samples to have contrasting genotypes
Preferred supported filtering criteria:
- Filtering on sample specific or minimal/maximum/average genotype depth
- Filtering on sample specific or minimal/maximum/average genotype quality
- Distance to nearest neighbor variants upstream and downstream
Depending on filter support for genotype depth and genotype quality these VCF/BCF/R specific format files can be quite big.
So compact (small data size) R specific format and multi-threaded filtering might also be useful/required when loading everything to memory on a big machine.
Or streaming parsing of the VCF/BCF file (like BCFtools and CYVCF2 do) .
What would be the best way and library to do SNP selection using the above filter criteria in R?
Is there maybe something like CYVCF2 (python wrapper around HTSLIB VCF parser) for R?
Or maybe a BCFTools like something in R?
Or some other good R library for doing this?
The targeted downstream users are R(studio) users who would like to have this functionality in the environment that they are used to and already use for related tasks.
The VCF can optionally be (pre) converted to TAB delimited format with bcftools query. Or to any specific R format.
A high level / coarse filter on region of interest and samples of interest filter might already have been done. It's often (but not always) more about iterative / dynamic fine grained filtering final selection of SNPs.
Why do you need to use R? I would never recommend anyone to do VCF filtering in R, or Excel.
Targeted downstream users are R(studio) users who would like to have this functionality in the environment that they are used to and already use for related tasks.
Okay, I don't know of any tools because it's not something about which I have thought. Maybe others have ideas.
From my perspective, you could use system calls from within R, i.e., system calls that call BCFtools, etc., and then wrap these as functions for your users. This assumes that they use R Server or R on a linux cluster.
I'm sure that Pierre has some ideas here, when he next checks in
Also remember that a VCF is essentiall just a tab-separated file, apart from the header. Parsing it should be easy in R, but I just worry about speed.
the VCF can optionally be converted to TAB delimited format with bcftools query. Or to any specific R format.
Should that not suffice, in that case? You may look up the MAF variant format, in that case, just to stick to convention. The MAF format was developed in light of the TCGA deluge of data, as far as I know.
I would always add a unique identifier for each variant, too, consisting of
sample:chr:pos:ref:alt
(i.e. used as rownames in a R data-frame)Just load a TAB file as a data frame is certainly an option. I was hoping there would be a R library that would support:
Sometimes the input VCF already has been filtered for regions and samples of interest. So it's now always very big (i.e. could be opened in Excel if you really wanted to) , but sometimes it still is very big (more suited for streaming processing with bcftools/cyvcf2).
Well, I worked with a colleague in California who always opened my VCFs in Excel, but therefore I had to limit the variants to < ~1,000,000 due to limits in Excel. Although frowned upon, it's not always possible to keep everything within the coding environment, nor is it practical.
see the previous question on biostars and the examples for my tool vcffilterjdk : http://lindenb.github.io/jvarkit/VcfFilterJdk.html
I was looking for something R specific Pierre. VcfFilterJdk is as R specific as writing an CYVCF2 python script and then executing that via R system2 call to the CYVCF2 script.