Question

ENCODE Hi-C FASTQ files are SINGLE end

0

Entering edit mode

8 months ago

hamarillo ▴ 80

Hi!

Does anyone know why ENCODE's Hi-C raw data (FASTQ files) are single-end?

See for example here: https://www.encodeproject.org/files/ENCFF846THP/

and the corresponding Sequence Read Archive Entry: https://www.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR25419778&display=metadata

There's only one read when Hi-C data is typically paired-end data. ENCODE's guidelines for Hi-C Data (here) say .fastq files are "G-zipped reads, paired-ended or single ended, stranded or unstranded." So I have to assume that both reads are within the sole .fastq file, but I don't have other clues and I don't know how to tell which part of the (I assume) chimeric read corresponds to which read.

I mean, even the ENCODE hic-pipeline shows FASTQ file PAIRS in their examples (see here)

I'd appreciate advice from someone who's worked with ENCODE single-end Hi-C data.

Thanks!

fastq layout ENCODE Hi-C • 927 views

ADD COMMENT • link updated 8 months ago by Yihang • 0 • written 8 months ago by hamarillo ▴ 80

score 1 · Answer 1 · 2024-03-12

I got this reply from ENCODE's help desk:

Thank you for the interest in ENCODE data.

The HiC experiments you are looking at have been produced using Ultima Genomics platform (https://www.ultimagenomics.com/) that generates the raw data as single ended reads.

Aiden lab has adjusted the code of the pipeline we used to uniformly process HiC data to work with single end reads as well as paired end reads.

https://github.com/ENCODE-DCC/hic-pipeline

So it's kind of making you use their pipeline ¯_(oo)_/¯

I figure that if I wanted to do it myself for some reason, the key would be to remove Ultima Genomic's sequences from the raw reads and then use BWA to align

score 0 · Answer 2 · 2024-03-30

Hi,

I am now looking at the same dataset. I am curious about how they convert one single-end read to a read pair after the alignment. My understanding is, two chimeric alignment results of a read actually represent two ends, and the original alignment of the whole read is useless. However, I find some reads that have more than two chimeric alignments, so I don't know in these cases which two chimeric reads are the two ends. Besides, I am not sure if we should choose two chimeric reads with one forward-aligned and one reverse-aligned.

Happy to discuss if you have any ideas on that. Thanks!