Comparing methylation data - Data cleaning and efficient code question
1
0
Entering edit mode
8.7 years ago

I have a data_1 which is in text format with columns, chr (representing chromosome number), stable_id, start, end & methylation. This is in txt format, mm9 version.

I have a data_2 which is in bigwig format with columns, seqnames, ranges, strand, methylation score. This is in mm10 format. (over 10 million rows)

I am to compare the data_1$start, data_1$end with data_2$ranges and compute the average methylation score and number of CpG islands.

Steps I followed which I believe is a long route.

  1. Step:1 - Converted data_1 to a file format like 'chrN:start-end' and exported the CSV .
  2. Step:2 - Used this CSV file, uploaded to ucsc genome browser LiftOver tool, converted from mm9 to mm10 - Output was a bed file.
  3. Step:3 - Replaced the start and end of data_1 file with new start and end coordinates of the liftovered output bed file.
  4. Step: 4- Comparing the start and end of data_1 with data_2, This is where I am stuck, takes a lot of time using R to process. IS there a simpler way than what I followed?

New to field. Please explain in steps.

genome sequencing R • 1.9k views
ADD COMMENT
3
Entering edit mode
8.7 years ago
PoGibas 5.1k

Welcome to Biostars.
Please see my answer: A: findOverlaps function in R
Here I use foverlaps from the data.table package. It is fast and should give what you want. If there are still problems please edit your question and we will help.
Basically you want to:

setkey(data_1, chr, start, end)  
setkey(data_2, chr, start, end)  
foverlaps(data_1, data_2)

Just friendly suggestion: don't name objects like data_1, use data1 instead. See Google's R Style Guide

ADD COMMENT

Login before adding your answer.

Traffic: 2125 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6