Question

Efficiently storing methylation data from BS-sequencing

0

Entering edit mode

9.4 years ago

Chris Miller 22k

How are people efficiently storing methylation data derived from bisulfite sequencing?

The results spit out from using BSmap's associated tools include a value for each cytosine in the genome, in something like the below format. There's clearly some redundancy here, which makes it highly compressible, but we're still looking at gigabytes of files in yet another poorly-documented format.

Has the community coalesced around a format that has decent tool chains attached? The sequence context, in particular, seems useful, but not easily slammed into a bed file. VCF, as always, seems flexible enough to handle this, but might be overkill. Is there something else (maybe even binary) that I'm not finding with a search? Thoughts appreciated.

Example output here:

chr  pos       strand  context  ratio  eff_CT_count  C_count  CT_count  rev_G_count  rev_GA_count  CI_lower  CI_upper
22   16050004  +       ATCTG    0.000  1.00          0        1         0            0             0.000     0.793
22   16050013  +       GTCCC    0.000  3.00          0        3         0            0             0.000     0.562
22   16050014  +       TCCCA    0.000  3.00          0        3         0            0             0.000     0.562

Update: So far the consensus on twitter seems to be that there's been no standard (or even well-defined) format. Would love to hear if anyone knows differently

bsseq bisulfilte-sequencing methylation • 3.1k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.4 years ago by Chris Miller 22k

1

Entering edit mode

Twitter's correct, there's no real standard. I hear someone is working on a VCF variant, but who knows when that'll be finished (not to mention that that will end up being the N+1th format for this type of data). Frankly, I think most of us use modified bedGraph files, like those produced by bismark, bison and PileOMeth...but those often don't include context.

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

Is there any (large) file to test somewhere?

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I'm a newcomer to BS-Seq - why is storing the context is important?

Also, why should we store C->G and CT->GA events separately?

Why not just have chr, pos, #C, #G ?

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by John 13k

1

Entering edit mode

I'm not a methylation expert either, but my understanding is that whether context is important depends completely on the types of questions you're asking. The immediately adjacent base is certainly important (CpG/CHH/CHG contexts) and farther flanking bases may be important clues about mechanisms of methylation, drug binding, etc

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by Chris Miller 22k

1

Entering edit mode

Yeah for analysis you'd want extra fields :)

Still, for storage of the data in its most compact form, I'd have thought chr/pos/C/G is all you need. Perhaps you could even go to just chr/pos/CpG/GpC to throw away all the C's that aren't part of a CpG.

I would be interested to know how many C's not part of a CpG are not converted, and if that's due to real methylation of non-CpG Cs or incomplete bisulphite conversion.

Still, anything beyond that is redundant id have thought. Ratios, strand, etc, can all be calculated on-the-fly quicker than reading the next row off the file, so why make your file twice as big :D

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by John 13k

0

Entering edit mode

The non-CpG methylation rate is species-dependent. For most mammals, you'd expect low single-digit percentages, though. I hear that this is different in plants, but I've never worked on them. The bedGraph-variant format that I mentioned above is essentially what you mentioned. Typically, you only store a single context in a given file. This still has an extra column, though, since often people want to collapse metrics into CpG level rather than single-C (to get higher power, since methylation tends to be symmetric).

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks Devon :)

So that might suggest that methylated non-CpG Cytosines are informative..?

I wonder if they cluster at certain genomic regions, or if they even have the same function as methylated CpGs. I need to research that a bit more because my understanding is a pretty poor.

About the symmetry though, that's a very good point. For a CpG you'd probably want a format that measures the methylation ratio of the forward and reverse strand on the same row, like:

chr | pos | C_F | C_R | G_F | G_R

where _F is the methylation (or lack of) at pos, and _R is the methylation (or lack of) at pos+1. The CpG is a single 'thing', and so I guess it makes more sense to make that a single row than two. I originally thought this is what was going on in the OP"s example with the "rev" columns (standing for reverse strand), but then I realised it can't be that since row 2/3 are talking about the same CpG just at pos and pos+1, so rev must stand for reverted or something.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by John 13k