Question

Why 6mA data are always 41 bps long?

0

Entering edit mode

2.7 years ago

Peter • 0

Hi all,

I read many articles about 6-mA methylation prediction. I am very confuse why most of open data such as Chen et al. (Chen et al., 2019) and Lv. (6mA-RicePred) are in form of 41 bps long with the methylation label is at the center. Why they didn't collect for longer sequence?

Thank you

methylation 6mA • 786 views

ADD COMMENT • link updated 2.7 years ago by acvill ▴ 350 • written 2.7 years ago by Peter • 0

score 1 · Accepted Answer · 2022-10-04

The paper you reference (Chen et al. 2019) includes the following excerpt, emphasis mine:

In order to construct a high-quality benchmark dataset, the following two procedures were performed. First, according to the Methylome Analysis Technical Note, a score of 30 is the default threshold for calling a nucleotide as modified. Hence, the sites with a modification score of <30 were filtered out. Second, a dataset containing many redundant samples with high similarity has the low statistical representativeness. A computational model, if trained and tested by such a biased benchmark dataset, might yield overestimated accuracy. To get rid of redundancy and minimize the bias, the CD-HIT software (Fu et al., 2012) with the cutoff threshold of 60% was used to remove those sequences with high sequence similarity. After following these two procedures, we obtained 880 positive samples. Preliminary tests indicated that the best predictive results were achieved when the sequence length is 41 bp.

So, it seems like 41 bp is the optimal length when considering the trade-off between sequence context and overfitting of the model.