Question

Analysis Big Data from Hi-C--> a way to find significant interaction?

0

Entering edit mode

9.4 years ago

Baptiste ▴ 90

Hey everyone.

I have just started an internship in bioinformatic and I have to deal with big data from Hi-C. I want to analyze my data with R.

The data looks like this:---> chrom start end count

I want to build a matrix where each bins fills with the count. And after, select significant interactions and (if it is possible) to plot the heat map.

It works for low resolution (500kb,100kb), but when I try to run my code with high resolution (10kb, 5kb), problems occurs and R doesn't want to compute with big data.

So I try to use a sparse matrix but I can't process all my code with this, I have to transform into an matrix.

So if you have a solution and you have already managed this kind of problem, let me know.

If you have a method to find significant interactions with high resolution, it will be great. =)

Thank you very much,
Baptiste

R HiC • 4.2k views

ADD COMMENT • link updated 22 months ago by jocelyn.petitto ▴ 20 • written 9.4 years ago by Baptiste ▴ 90

2

Entering edit mode

Can you be a bit more specific than "R doesn't want to compute"? Are you running out of memory? Or getting an error message? (which one?) Is it a problem with a package from CRAN? Or your own code? Or is it a problem with the data itself? What kind of operation are you trying to apply?

ADD REPLY • link updated 2.0 years ago by Ram 44k • written 9.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hey

Thank you for reply.

So this is my code:

file1---> is the rawdata from hic (chrom start end count)
file2---> is the file to normalize rawdata
binze---> for the resolution so 500kb=5e5
dimension---> size of the matrix

MyMatrix <- sparseMatrix(i = file1$V1/binSize + 1, j = file1$V2/ binSize+1, x = file1$V3,dims = c(dimension,dimension))
vector<-file2$V1
MatrixVector <- vector %o% vector
MatrixNorm <- MyMatrix / MatrixVector
as.matrix(MatrixNorm)
MatrixNorm1<-as.matrix(forceSymmetric(MatrixNorm))#I want to have a symmetric matrix for the heatmap.

The real problem is not really my code. But to deal with big data in R and find significant interactions between both side of the DNA. I am sorry is the my request was not clear,

Real problem is: Does it exist a way to find significant interaction with high resolution?

ADD REPLY • link updated 2.0 years ago by Ram 44k • written 9.4 years ago by Baptiste ▴ 90

Ram · Answer 1 · 2015-07-07

3

Entering edit mode

9.4 years ago

Fidel ★ 2.0k

To solve your problem with large matrices I recommend you to do your analysis per chromosome. This will dramatically reduce the size of the matrix.

Moreover, which method are you using to identify enriched contacts?

Apart from the problem of handling large matrices in R, I would be concerned that, with increased resolution the statistical power to discern significant contacts is reduced. Be sure that you have sufficient counts per cell in your matrix. This is off course dependent on the depth of sequencing, the final number of usable reads, and the size of the genome.

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 9.4 years ago by Fidel ★ 2.0k

0

Entering edit mode

Hy Fidel,

Thank you for your reply.

Yes I forgot specifying that I only work with intrachromosomal interaction and only one chromosome.

To identify enriched contacts, I use "quantile" to find a threshold, then I apply this threshold to select the values above.

Yes this is a real problem because there are a lot of "NaN" (it means that does not converge) and I have to deal with that. Unfortunately, I can't replace NaN by 0.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.4 years ago by Baptiste ▴ 90

0

Entering edit mode

I work with python and so far I didn't have a problem with the matrix size. Maybe you can try with python.

Have you checked the methods to compute long-range contacts by Job Dekker (Sanyal et al. Nature, 2012), Victor Corces (Hou et al. Mol. Cell 2012), Bing Ren (same as Corce's) (Jin et al. Nature 2013) and Lieberman-Aiden (Rao et al. 2014)?

ADD REPLY • link updated 2.0 years ago by Ram 44k • written 9.4 years ago by Fidel ★ 2.0k

Ram · Answer 2 · 2015-10-09

You might want to try python + numpy for this.

Though - why do you need to hold the entire genome-wide matrix in memory? Do you need the trans data as well? Or you can do as Fidel suggests and perform your calculations on each chromosome separately? Can you do your calculations in blocks/chunks?

In fact, the matrix format while useful for visualization, is not ideal as a data structure. What about sub-setting by genome distance (then you can remove n-diagonals from the matrix - effectively hidden from memory in sparse format)?

Some recent papers perform a local peak calling, which in effect allows each submatrix to be 'peak-called' independently and thus reduces memory requirements and allows you to compute in parallel! Though you need to think carefully about what you hope to achieve when calling peaks and what your definition of a 'peak/loop' actually means. (global vs local peak calling will produce vastly different results)

Also be aware about the distance bias with any interaction data. Loci close in the linear genome will also be close in the 3D genome and will have the strongest interactions signals. Depending on how you implement your peak calling - you may have to normalize for genome distance first before performing any quantile based peak calling...!

Ram · Answer 3 · 2015-07-07

0

Entering edit mode

9.4 years ago

Asaf 10k

Have you tried the Bioconductor packages: Bioconductor - GOTHiC and/or Bioconductor - HiTC?

Regardless of these packages you can bin your data according to the restriction enzyme recognition sites which should reduce its complexity (if it's not already binned).

ADD COMMENT • link updated 2.0 years ago by Ram 44k • written 9.4 years ago by Asaf 10k

0

Entering edit mode

Hey Asaf thank for your reply.

The problem is: I don't have the mapped data. I try to find other software like:

SeqMonk
HOMER
HiClib
HiBrowse

But these software only work with mapped data in input, and it seems like a lot of work to process with that way (to convert my data).

What do you think?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.4 years ago by Baptiste ▴ 90

1

Entering edit mode

If you have chromosome, start, end, and count... that is the RESULT of having mapped data. That is not raw Hi-C data. Raw data would be the read fragments. Very generalized steps: 1st read framents 2nd aligned to a genome 3rd bins with counts <- this is what it sounds like you have

ADD REPLY • link 22 months ago by jocelyn.petitto ▴ 20