Hadley Wickham, the author of ggplot and many other popular R packages, has recently published a very good paper regarding the principles of tidy data. This article introduces a new library called tidyr, and also proposes a standard for formatting and organizing data before data analysis.
I personally think that the principles proposed in the article are very good, and that they help a lot in data analysis. Some of these are already adopted by many ggplot2/plyr users, as you need a data frame in a long format in order to produce most of the plots.
My question is whether it would make sense to apply these principles to bioinformatics. In particular, if we look at the VCF format, it fails at least two of the rules mentioned in the paper:
- "3.1. Column headers are values, not variable names" (because individuals are encoded on distinct columns)
- "3.2. Multiple variables stored in one column" (because each genotype column contains the status of one or more alleles, plus its coverage etc...
For example, let's take the example from the 4.0 specs of VCF:
> vcf = read.table('vcf_example.vcf', header=T, comment.char='')
> vcf
X.CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
1 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
2 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
5 20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
If we were to convert this data to a "tidyr" format, we could use the following:
> vcf %>%
gather(individual, value, -c(X.CHROM:FORMAT)) %>%
separate(value, into=strsplit('GT:GQ:DP:HQ', ':')[[1]], ':', extra='drop') %>%
separate('GT', into=c('allele1', 'allele2'), '[|/]') %>%
gather(allele, genotype, -c(X.CHROM:individual, GQ:HQ)) %>%
arrange(X.CHROM, POS, ID, individual) %>%
select(-INFO, -FORMAT, -FILTER) %>% # let's omit this for better visualization
subset(ID!='microsat1') # let's omit this for better visualization
X.CHROM POS ID REF ALT QUAL individual GQ DP HQ allele genotype
20 14370 rs6054257 G A 29 NA00001 48 1 51,51 allele1 0
20 14370 rs6054257 G A 29 NA00001 48 1 51,51 allele2 0
20 14370 rs6054257 G A 29 NA00002 48 8 51,51 allele1 1
20 14370 rs6054257 G A 29 NA00002 48 8 51,51 allele2 0
20 14370 rs6054257 G A 29 NA00003 43 5 .,. allele1 1
20 14370 rs6054257 G A 29 NA00003 43 5 .,. allele2 1
20 17330 . T A 3 NA00001 49 3 58,50 allele1 0
20 17330 . T A 3 NA00001 49 3 58,50 allele2 0
20 17330 . T A 3 NA00002 3 5 65,3 allele1 0
20 17330 . T A 3 NA00002 3 5 65,3 allele2 1
20 17330 . T A 3 NA00003 41 3 <NA> allele1 0
20 17330 . T A 3 NA00003 41 3 <NA> allele2 0
20 1110696 rs6040355 A G,T 67 NA00001 21 6 23,27 allele1 1
20 1110696 rs6040355 A G,T 67 NA00001 21 6 23,27 allele2 2
20 1110696 rs6040355 A G,T 67 NA00002 2 0 18,2 allele1 2
20 1110696 rs6040355 A G,T 67 NA00002 2 0 18,2 allele2 1
20 1110696 rs6040355 A G,T 67 NA00003 35 4 <NA> allele1 2
20 1110696 rs6040355 A G,T 67 NA00003 35 4 <NA> allele2 2
20 1230237 . T . 47 NA00001 54 7 56,60 allele1 0
20 1230237 . T . 47 NA00001 54 7 56,60 allele2 0
20 1230237 . T . 47 NA00002 48 4 51,51 allele1 0
20 1230237 . T . 47 NA00002 48 4 51,51 allele2 0
20 1230237 . T . 47 NA00003 61 2 <NA> allele1 0
20 1230237 . T . 47 NA00003 61 2 <NA> allele2 0
(note that I omitted a few columns for visualization purpose and that I should also separate the HQ and INFO columns, but I think the concept is clear)
On one hand, this tidy format seems a lot more redundant than what is needed. Lot of information is repeated multiple times, and this file would require a lot of disk space to be stored (although maybe an efficient compression or a data base would reduce the problem).
On the other hand, this tidy format may be easier to use. To me, it seems easier to get the status of a given allele in a given individual, as I just need to find the corresponding row. Another advantage is that it would be easier to add new information to the file, like the population group of each individual, just by adding a new column.
> vcf.tidy %>% mutate(genotype.nt = ifelse(genotype==0, as.character(REF), as.character(ALT)), pop=ifelse(individual %in% c('NA00001', 'NA00002'), 'CHB', 'YRI'))
X.CHROM POS ID REF ALT QUAL individual GQ DP HQ allele gen gen.nt pop
20 14370 rs6054257 G A 29 NA00001 48 1 51,51 allele1 0 G CHB
20 14370 rs6054257 G A 29 NA00001 48 1 51,51 allele2 0 G CHB
20 14370 rs6054257 G A 29 NA00002 48 8 51,51 allele1 1 A CHB
20 14370 rs6054257 G A 29 NA00002 48 8 51,51 allele2 0 G CHB
20 14370 rs6054257 G A 29 NA00003 43 5 .,. allele1 1 A YRI
20 14370 rs6054257 G A 29 NA00003 43 5 .,. allele2 1 A YRI
20 17330 . T A 3 NA00001 49 3 58,50 allele1 0 T CHB
20 17330 . T A 3 NA00001 49 3 58,50 allele2 0 T CHB
20 17330 . T A 3 NA00002 3 5 65,3 allele1 0 T CHB
20 17330 . T A 3 NA00002 3 5 65,3 allele2 1 A CHB
20 17330 . T A 3 NA00003 41 3 <NA> allele1 0 T YRI
20 17330 . T A 3 NA00003 41 3 <NA> allele2 0 T YRI
20 1110696 rs6040355 A G,T 67 NA00001 21 6 23,27 allele1 1 G,T CHB
20 1110696 rs6040355 A G,T 67 NA00001 21 6 23,27 allele2 2 G,T CHB
20 1110696 rs6040355 A G,T 67 NA00002 2 0 18,2 allele1 2 G,T CHB
20 1110696 rs6040355 A G,T 67 NA00002 2 0 18,2 allele2 1 G,T CHB
20 1110696 rs6040355 A G,T 67 NA00003 35 4 <NA> allele1 2 G,T YRI
20 1110696 rs6040355 A G,T 67 NA00003 35 4 <NA> allele2 2 G,T YRI
20 1230237 . T . 47 NA00001 54 7 56,60 allele1 0 T CHB
20 1230237 . T . 47 NA00001 54 7 56,60 allele2 0 T CHB
20 1230237 . T . 47 NA00002 48 4 51,51 allele1 0 T CHB
20 1230237 . T . 47 NA00002 48 4 51,51 allele2 0 T CHB
20 1230237 . T . 47 NA00003 61 2 <NA> allele1 0 T YRI
20 1230237 . T . 47 NA00003 61 2 <NA> allele2 0 T YRI
What do you think? Will we all convert to tidy VCF in the far future?
Yes, a format more geared towards tidy data would be highly beneficial for the field.