Why 6mA data are always 41 bps long?
1
0
Entering edit mode
2.2 years ago
Peter • 0

Hi all,

I read many articles about 6-mA methylation prediction. I am very confuse why most of open data such as Chen et al. (Chen et al., 2019) and Lv. (6mA-RicePred) are in form of 41 bps long with the methylation label is at the center. Why they didn't collect for longer sequence?

Thank you

methylation 6mA • 638 views
ADD COMMENT
1
Entering edit mode
2.2 years ago
acvill ▴ 350

The paper you reference (Chen et al. 2019) includes the following excerpt, emphasis mine:

In order to construct a high-quality benchmark dataset, the following two procedures were performed. First, according to the Methylome Analysis Technical Note, a score of 30 is the default threshold for calling a nucleotide as modified. Hence, the sites with a modification score of <30 were filtered out. Second, a dataset containing many redundant samples with high similarity has the low statistical representativeness. A computational model, if trained and tested by such a biased benchmark dataset, might yield overestimated accuracy. To get rid of redundancy and minimize the bias, the CD-HIT software (Fu et al., 2012) with the cutoff threshold of 60% was used to remove those sequences with high sequence similarity. After following these two procedures, we obtained 880 positive samples. Preliminary tests indicated that the best predictive results were achieved when the sequence length is 41 bp.

So, it seems like 41 bp is the optimal length when considering the trade-off between sequence context and overfitting of the model.

ADD COMMENT

Login before adding your answer.

Traffic: 1806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6