How are people efficiently storing methylation data derived from bisulfite sequencing?
The results spit out from using BSmap's associated tools include a value for each cytosine in the genome, in something like the below format. There's clearly some redundancy here, which makes it highly compressible, but we're still looking at gigabytes of files in yet another poorly-documented format.
Has the community coalesced around a format that has decent tool chains attached? The sequence context, in particular, seems useful, but not easily slammed into a bed file. VCF, as always, seems flexible enough to handle this, but might be overkill. Is there something else (maybe even binary) that I'm not finding with a search? Thoughts appreciated.
Example output here:
chr pos strand context ratio eff_CT_count C_count CT_count rev_G_count rev_GA_count CI_lower CI_upper
22 16050004 + ATCTG 0.000 1.00 0 1 0 0 0.000 0.793
22 16050013 + GTCCC 0.000 3.00 0 3 0 0 0.000 0.562
22 16050014 + TCCCA 0.000 3.00 0 3 0 0 0.000 0.562
Update: So far the consensus on twitter seems to be that there's been no standard (or even well-defined) format. Would love to hear if anyone knows differently
Twitter's correct, there's no real standard. I hear someone is working on a VCF variant, but who knows when that'll be finished (not to mention that that will end up being the N+1th format for this type of data). Frankly, I think most of us use modified bedGraph files, like those produced by bismark, bison and PileOMeth...but those often don't include context.
Is there any (large) file to test somewhere?
I'm a newcomer to BS-Seq - why is storing the context is important?
Also, why should we store C->G and CT->GA events separately?
Why not just have chr, pos, #C, #G ?
I'm not a methylation expert either, but my understanding is that whether context is important depends completely on the types of questions you're asking. The immediately adjacent base is certainly important (CpG/CHH/CHG contexts) and farther flanking bases may be important clues about mechanisms of methylation, drug binding, etc
Yeah for analysis you'd want extra fields :)
Still, for storage of the data in its most compact form, I'd have thought chr/pos/C/G is all you need. Perhaps you could even go to just chr/pos/CpG/GpC to throw away all the C's that aren't part of a CpG.
I would be interested to know how many C's not part of a CpG are not converted, and if that's due to real methylation of non-CpG Cs or incomplete bisulphite conversion.
Still, anything beyond that is redundant id have thought. Ratios, strand, etc, can all be calculated on-the-fly quicker than reading the next row off the file, so why make your file twice as big :D
The non-CpG methylation rate is species-dependent. For most mammals, you'd expect low single-digit percentages, though. I hear that this is different in plants, but I've never worked on them. The bedGraph-variant format that I mentioned above is essentially what you mentioned. Typically, you only store a single context in a given file. This still has an extra column, though, since often people want to collapse metrics into CpG level rather than single-C (to get higher power, since methylation tends to be symmetric).
Thanks Devon :)
So that might suggest that methylated non-CpG Cytosines are informative..?
I wonder if they cluster at certain genomic regions, or if they even have the same function as methylated CpGs. I need to research that a bit more because my understanding is a pretty poor.
About the symmetry though, that's a very good point. For a CpG you'd probably want a format that measures the methylation ratio of the forward and reverse strand on the same row, like:
where
_F
is the methylation (or lack of) at pos, and_R
is the methylation (or lack of) at pos+1. The CpG is a single 'thing', and so I guess it makes more sense to make that a single row than two. I originally thought this is what was going on in the OP"s example with the "rev" columns (standing for reverse strand), but then I realised it can't be that since row 2/3 are talking about the same CpG just at pos and pos+1, so rev must stand for reverted or something.