Question

Parsing GATK Variant Tables with R

0

Entering edit mode

7.7 years ago

rb14sp ▴ 40

Hi, New to R and have variant tables from GATK VariantstoTables walker that I need to parse, giving me a headache. The format is as follows: CHR\tPOS\tID\tREF\tALT\tAC\tANN\tSAMPLE1_GQ\tSAMPLE1_DP\tSAMPLE2_GQ\tSAMPLE2_DP ... and so on for all my samples.

All I'd like to do is for each row determine all the samples with GQ >= 20 and DP >= 10 and write a new tab file with each as follows:

CHR\tPOS\tID\tREF\tALT\tAC\tANN\tFIRST_SAMPLE_PASSING_FILTER\tSECOND_SAMPLE_PASSING_FILTER\t .. and so on.

This should be simple but my R is really bad at the moment and I'm in a rush so if someone can lend me a hand I'd appeciate it. thanks - Robert

R genome • 2.5k views

ADD COMMENT • link updated 7.7 years ago by Santosh Anand 5.8k • written 7.7 years ago by rb14sp ▴ 40

0

Entering edit mode

Load in Excel or libreoffice. Filter by column values.

ADD REPLY • link 7.7 years ago by cpad0112 21k

1

Entering edit mode

Excel is not an acceptable data analysis tool.

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

2

Entering edit mode

I was about to write it is not excel that you should use. Remember vcf4.0 is a specific tab delimited format. If you do it with awk means you are breaking that format and then you need to reformat the tab output to again the vcf4.0 format. The best way is to use what Santosh said. -SelectVariants and -select handle from GATK. It works on vcf4.0 format and can do your operations. Since the file you are talking about is a GATK output

ADD REPLY • link 7.7 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

GATK lets you filter on the genotype properties in the format field, yes, I've done that already. So some of my format fields have filter flags indicating lowGQ or lowDP based on the criteria I specified. As for Excel, I have 140 samples and the files are very large and stored on a remote computer. Even if I downloaded them I'd have a hard time opening them in excel. In fact, I don't think there's enough space to fit them on my laptop.

ADD REPLY • link 7.7 years ago by rb14sp ▴ 40

score 1 · Answer 1 · 2017-05-08

1

Entering edit mode

7.7 years ago

Santosh Anand 5.8k

Why kill the mosquito with a sword? You can simply use awk for that. See example 5 below. http://www.thegeekstuff.com/2010/01/awk-introduction-tutorial-7-awk-print-examples

Or can use snpSift, which can do more sophisticated filtering http://snpeff.sourceforge.net/SnpSift.html

ADD COMMENT • link 7.7 years ago by Santosh Anand 5.8k

1

Entering edit mode

Fan of awk as always but if am not wrong GATK has handles or let's say functions that can perform such filtering and preserve the vcf4.0 format. In awk I guess if you convert into the required criterion you will again reformat the data into vcf4.0 format from the awk output which will be tab delimited but yes if you need to make a proper awk command which skips you entry lines and also reprint them and perform the operations that do your required query it will be fine. So GATK handles will be better suited here. You don't even need R

ADD REPLY • link 7.7 years ago by ivivek_ngs ★ 5.2k

2

Entering edit mode

Yes, you are right - SelectVariants can also do the job preserving even the VCF format .

http://gatkforums.broadinstitute.org/gatk/discussion/1255/using-jexl-to-apply-hard-filters-or-select-variants-based-on-annotation-values

Though my feeling is that the OP has already tab-delimited data, with which he wants to do some downstream analysis in R.

ADD REPLY • link 7.7 years ago by Santosh Anand 5.8k

0

Entering edit mode

Certainly that was my motivation ;)

ADD REPLY • link 7.7 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Yes, I was pretty clear about the tab delimited part. I have variants in tables. GATK SelectVariants does not do what I want. I'm talking about listing samples based on values in the genotype field. For sites that PASS filter, which are the only sites written to output by VariantstoTables, I just want the samples with DP>=10 and GQ>=20. Again, it's all in tab delimited format as described above. For each row (i.e., variant ) get the samples that pass the conditions. For each variant I will then have the chrom, pos, id, ref, alt, ac, ANN, and a list of samples with the variant.

ADD REPLY • link 7.7 years ago by rb14sp ▴ 40

0

Entering edit mode

Why kill the mosquito with a sword?

This is applicable to this also: Good practices while dealing with very large datasets

ADD REPLY • link 7.7 years ago by venu 7.1k