Hello
Simple question ( I hope!)
I'm using this VariantAnnotation package from bioconductor to work with SNPs.
myparam = ScanVcfParam( # this function is a fast positional filter
samples="3441_pseudo_gatk",
which= GRanges(chr[14], # This is a chromossome name
ranges=IRanges( # these are custom filters
start= 10,
end=10000)))
VCF <- readVcf(TabixFile("my_potato.vcf.gz"), "Potato", param=myparam)
VCF is a "Formal class CollapsedVCF". It's very complex and there are extensive instructions, but I cannot find the answer to the most simple thing of them all:
How do I parse the raw data of VCF in R?
I know that if I write it out with:
writeVcf(obj = VCF, filename = "tmp.vcf")
I get exactly what I want in a file in my working directory. However, I would like parsing all lines in R as a vector to filter, as I already wrote a filter that gives index positions.
All of this because I don't like the filterVcf() function. It doesn't take advantage of the fact that VCF is already filtered by positions and is thus smaller than the original "my_potato.vcf.gz".
EDIT : Actually, I got a detail wrong. FilterVcf() can indeed use the positional filter myparam with param=myparam.
Thank you for the attention!
Ricardo
What do you mean by "parsing all lines in R as a vector to filter, as I already wrote a filter that gives index positions."... You can pretty easily convert all the VCF information to data.frames but getting it back into VCF format is not as easy.
What exactly do you want to extract? and what is your filter?
I have a filtering function that outputs desired indexes (based on, for example, the value of info(VCF)$MQ ) but I can't subscript VCF with indexes because it's not a dataframe or a vector. That is why I thought of parsing the raw lines as elements of a vector. I can just concatenate them to the header to create a new vcf file.
I would simply like to filter out certain lines of the vcf. For example, to filter out lines 1,5,7:
A raw line should look like this:
Check out bcftools, especially
bcftools view
.Much easier to use and faster than R. Just remember to be on top of alt-alleles, or to avoid confusion, split multi-allelic entries to biallelic entries using
bcftools norm
.The format that the package parses a VCF to is insanely confusing! Why even use R to work with huge VCF files?
I know, right?? I initially did a python script and it was much easier, besides having better performance. But I need to deliver an R filter to a client that does not know (or even have installed) python...
The client wants the filter in R specifically? You can always
system("bcftools view")
:-DThank you for the suggestion. I already had the answer I was looking for. The client wants an R script to run in windows, hehe.