Parse raw vcf in R from CollapsedVCF (VariantAnnotation library)
1
1
Entering edit mode
5.8 years ago

Hello

Simple question ( I hope!)

I'm using this VariantAnnotation package from bioconductor to work with SNPs.

myparam = ScanVcfParam(               # this function is a fast positional filter
    samples="3441_pseudo_gatk", 
    which= GRanges(chr[14],          # This is a chromossome name
                   ranges=IRanges(   # these are custom filters
                       start= 10,
                       end=10000)))

    VCF <- readVcf(TabixFile("my_potato.vcf.gz"), "Potato", param=myparam)

VCF is a "Formal class CollapsedVCF". It's very complex and there are extensive instructions, but I cannot find the answer to the most simple thing of them all:

How do I parse the raw data of VCF in R?

I know that if I write it out with:

writeVcf(obj = VCF, filename =  "tmp.vcf")

I get exactly what I want in a file in my working directory. However, I would like parsing all lines in R as a vector to filter, as I already wrote a filter that gives index positions.

All of this because I don't like the filterVcf() function. It doesn't take advantage of the fact that VCF is already filtered by positions and is thus smaller than the original "my_potato.vcf.gz".

EDIT : Actually, I got a detail wrong. FilterVcf() can indeed use the positional filter myparam with param=myparam.

Thank you for the attention!
Ricardo

VariantAnnotation bioconductor vcf • 2.8k views
ADD COMMENT
0
Entering edit mode

What do you mean by "parsing all lines in R as a vector to filter, as I already wrote a filter that gives index positions."... You can pretty easily convert all the VCF information to data.frames but getting it back into VCF format is not as easy.

What exactly do you want to extract? and what is your filter?

ADD REPLY
0
Entering edit mode

I have a filtering function that outputs desired indexes (based on, for example, the value of info(VCF)$MQ ) but I can't subscript VCF with indexes because it's not a dataframe or a vector. That is why I thought of parsing the raw lines as elements of a vector. I can just concatenate them to the header to create a new vcf file.

I would simply like to filter out certain lines of the vcf. For example, to filter out lines 1,5,7:

index <- c(1,5,7)

vector[-index]  # this would be a vector on which each element is a raw line of the vcf file `

A raw line should look like this:

ChrUn|ref0000006|ref0000012|ref0000001|ref0000010   24754   .   T   A94.77  10X_ALLELE_FRACTION_FILTER  AC=1;AF=0.5;AN=2;BaseQRankSum=2.155;ClippingRankSum=-1.643;DP=276;FS=2.9;MLEAC=1;MLEAF=0.5;MQ=46.59;MQRankSum=-0.851;QD=0.34;ReadPosRankSum=0.077;SOR=0.782;MUMAP_REF=33.0841;MUMAP_ALT=37.2581;AO=23;RO=259;MMD=2.51716;RESCUED=4;NOT_RESCUED=455;HAPLOCALLED=0    GT:AD:DP:GQ:PL:BX:PS    0/1:234,22:256:99:123,0,10080:GAAGTTCAGCTGCCGT-1_60;CAAGGCCGTCAAGAGC-1_74;ACGTCAAAGCACTCAT-1_74;AACGTTGAGGGTGTGT-1_65;GCTGCAGCAATGGACG-1_74;ACTACGAGTTGCCCGA-1_70;AAAGCGGCATGTGGGA-1_74;ACGTATGAGCCAGTAG-1_60;CCAAGATTCACGGCCA-1_74;GCGCAACCAATCGCAT-1_74_70;GGAACGATCTCCAGCT-1_60;ATAGCGTAGGTGATAT-1_74;CTCATTATCTTCGACC-1_74;CTCTGTGAGATTCACC-1_74;ATTCTACGTGTAACCT-1_70;TCAGCTCCACAGTGAG-1_74;CTTCATAAGC
ADD REPLY
1
Entering edit mode

Check out bcftools, especially bcftools view.

Much easier to use and faster than R. Just remember to be on top of alt-alleles, or to avoid confusion, split multi-allelic entries to biallelic entries using bcftools norm.

ADD REPLY
0
Entering edit mode

The format that the package parses a VCF to is insanely confusing! Why even use R to work with huge VCF files?

ADD REPLY
0
Entering edit mode

I know, right?? I initially did a python script and it was much easier, besides having better performance. But I need to deliver an R filter to a client that does not know (or even have installed) python...

ADD REPLY
0
Entering edit mode

The client wants the filter in R specifically? You can always system("bcftools view") :-D

ADD REPLY
0
Entering edit mode

Thank you for the suggestion. I already had the answer I was looking for. The client wants an R script to run in windows, hehe.

ADD REPLY
3
Entering edit mode
5.8 years ago
d-cameron ★ 2.9k

How do I parse the raw data of VCF in R?

You already know the answer to that: it's the readVcf() function. You use the VariantAnnotation library to fully parse the file, or you use a text file parser if you need the raw input lines for some reason (you very rarely do).

See here for an example of where I needed to do exactly that to work around a bug in the VariantAnnotation library.

I have a filtering function that outputs desired indexes (based on, for example, the value of info(VCF)$MQ ) but I can't subscript VCF with indexes because it's not a dataframe or a vector.

I would simply like to filter out certain lines of the vcf. For example, to filter out lines 1,5,7:

Sub-setting a VCF using VariantAnnotation works exactly like you describe. Just subset the VCF using the notation you would use for a vector: filteredVcf = vcf[-c(1, 5, 7)] will removes the first, firth, and seventh variant from the VCF.

You can also subset using a logical vector. For example: highqualvcf = vcf[VariantAnnotation::fixed(vcf)$QUAL > 20]

ADD COMMENT
0
Entering edit mode

Perfect answer, thank you!

I didn't realize VCF[-c(1, 5, 7)] would subset the vcf like if it was a vector. It only prints the metainformation but when doing info(filteredVcf) you can can see that the rows are gone.

Once again thanks and have a nice day/evening!

ADD REPLY

Login before adding your answer.

Traffic: 2497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6