Entering edit mode
4.3 years ago
dcheng1
•
0
I merged multiple fastq files using the command below:
cat *.fastq.gz > merge.fastq.gz
The merged fastq file was succefully aligned to reference genome. But I received the following error when calling peaks using SICER: IndexError: list index out of range.
Peak calling using single fastq file works well, so the problem should from the merged files. Why merging fastq files will cause IndexError?
Please show all code. We cannot see your screen. You may also want to check out
epic2
, a high-performance reimplementation of SICER that is by far less painful to use and faster.Thanks for your response! See the code below.
Is there actually something in
merge_sort.bam
. The sort command writes tomerge_sort.bam
twice, once from-o
and once from stdout. Would not surprise me if the file was corrupted. What is the output ofsamtools flagstat merge_sort.bam
? Really try epic2, it accepts bam, no need for this fiddling with BED files.Yes,
merge_sort.bam
is a very large file. The output ofsamtools flagstat merge_sort.bam
is :And thanks for your suggestion, I will try epic2.
Almost a billion reads, wow, is that intended? For ChIP-seq? Be sure to run a duplicate marking algorithm before calling your peaks. At that depth it is almost certain that you have a lot of duplicates. I do not remember whether epic2 has a duplicate removal algorithm built-in.
Yes, it is. That's because I merged over 50 fastq files into a single file
cat *.fastq.gz > merge.fastq.gz
. SICER can remove redundant reads, so epic2 probably also can do duplicate removal.Was data from the same sample?
No, it's from differernt samples. I'm doing so because I want to get a list of peaks that include peaks of all samples.
I think you need to use something like DROMPAplus (just as an example) to analyze multiple samples. Concatenating all samples into a single file does not sound like the correct way to do this. A manual is available here.
Okay, thanks for your information, I will look into that.
There are multiple ways to do that. First, catting all files together and then call peaks is acceptable if you plan to use the peak set for differential analysis. The paper on the
csaw
tool describes in great detail several peak calling strategies. Alternatively, use a peak caller that uses replicates, such asPePr
(requires an input sample) orGenrich
to get peaks that are reproducible across replicates. This obviously is only a good idea when samples belong to the same treatment group. Third, do the latter for each group separately, then merge peaks. Many ways to do that, and it depends n which question you want to answer. If these are independent replicates you want to discard replicates per file but then keep those reads that are duplicates across files since these cannot be PCR redundancies. I would spend some time reading literature and previously posts on ChIP-seq before deciding for a pipeline. Then choose the one suitable to answer your scientific question.Yep, my goal is to generate a peak set for differential analysis. I've trid merging peaks using the code below. But I ended up getting super-long peaks. And this introduced many false postives to the downstream differential analysis. I will look into the csaw paper you mentioned. Thanks!