Hi, New to R and have variant tables from GATK VariantstoTables walker that I need to parse, giving me a headache. The format is as follows: CHR\tPOS\tID\tREF\tALT\tAC\tANN\tSAMPLE1_GQ\tSAMPLE1_DP\tSAMPLE2_GQ\tSAMPLE2_DP ... and so on for all my samples.
All I'd like to do is for each row determine all the samples with GQ >= 20 and DP >= 10 and write a new tab file with each as follows:
CHR\tPOS\tID\tREF\tALT\tAC\tANN\tFIRST_SAMPLE_PASSING_FILTER\tSECOND_SAMPLE_PASSING_FILTER\t .. and so on.
This should be simple but my R is really bad at the moment and I'm in a rush so if someone can lend me a hand I'd appeciate it. thanks - Robert
Load in Excel or libreoffice. Filter by column values.
Excel is not an acceptable data analysis tool.
I was about to write it is not excel that you should use. Remember vcf4.0 is a specific tab delimited format. If you do it with awk means you are breaking that format and then you need to reformat the tab output to again the vcf4.0 format. The best way is to use what Santosh said.
-SelectVariants
and-select
handle from GATK. It works on vcf4.0 format and can do your operations. Since the file you are talking about is a GATK outputGATK lets you filter on the genotype properties in the format field, yes, I've done that already. So some of my format fields have filter flags indicating lowGQ or lowDP based on the criteria I specified. As for Excel, I have 140 samples and the files are very large and stored on a remote computer. Even if I downloaded them I'd have a hard time opening them in excel. In fact, I don't think there's enough space to fit them on my laptop.