Hello world,
So I'm trying to find out if certain regions of interest interact from these published results. I am no specialist when it comes to Hi-C data, or bioinformatics clearly, but I do have a strong theoretical background and have watched and read tutorials on how to process RAW reads. So, when I finally felt confident to process the reads, I went to the accension and picked up these files which were described in the paper as RAW reads but they obviously are not. I've tried plugging this file into SeqMonk but it would'nt take it because this file is weird?
I want to know what these columns are because I've asked the authors but to no solid reply. Also, how do I modify this file for better visualization?
Self described as:
Library strategy: Hi-C Hi-C reads were aligned using Bowtie 0.12.7 with default parameters and “-m 1” PCR duplicate reads were removed GC content, mappability, and fragment length effects were normalized as described in Hou et al., Molecular Cell 48, 471-484 (2012). Genome_build: dm3 Supplementary_files_format_and_content: Hi-C processed files are in a modified bed format. Each row lists the chromosome and the start and end coordinates of two interacting bins as well as the normalized interaction frequency between these two bins
There are no headers on this file and to me this is not a traditional .hic file. What do you interpret these columns to be?! Also apparently in column "I" 0 = + strand, 16 = - strand...
It is best to upload images to a free image hosting provider (e.g https://imgbb.com/ ) and then include the http links in your post.
As much as we appreciate the humor in the title, a brief description of your problem would be more useful and more appropriate.
Can you provide a link for the paper which corresponds to this data? Did this file come from supplementary materials or GEO/SRA? I assume the Molecular Cell reference is only describing the method used in that paper?
FWIW: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1551439
Raw fastq data can be found at ENA.
Thank you for your support genomax and Sean Davis.
I'm guessing you guys suggest I start from the raw reads instead of trying to decipher this formatting?
The paper referred above is their own. Take a look at it and the supplementary files. They may be more useful from the file above.
Depends on your use case. Processing Hi-C data is pretty compute-intensive in many cases (many reads, multiple steps). If your goal is to step off from where the paper left off, using their processed data will be simplest. If you already have a Hi-C pipeline (or want to go through the process of developing one), then starting with raw reads seems a great way to go.
Yeah this has gave me more headache than doing it myself. Ill start from scratch. Thanks for your help