Merging multiple fastq files and having IndexError when calling peaks using SICER
0
0
Entering edit mode
4.3 years ago
dcheng1 • 0

I merged multiple fastq files using the command below: cat *.fastq.gz > merge.fastq.gz

The merged fastq file was succefully aligned to reference genome. But I received the following error when calling peaks using SICER: IndexError: list index out of range.

Peak calling using single fastq file works well, so the problem should from the merged files. Why merging fastq files will cause IndexError?

ChIP-Seq NGS • 1.5k views
ADD COMMENT
0
Entering edit mode

Please show all code. We cannot see your screen. You may also want to check out epic2, a high-performance reimplementation of SICER that is by far less painful to use and faster.

ADD REPLY
0
Entering edit mode

Thanks for your response! See the code below.

trim_galore merge.fastq.gz 
(bowtie2 -p 16 -x /home/Genome/mm10/mm10 -U merge_trimmed.fq.gz -S merge.sam) 2> merge_alignsum.log;

samtools view -bq 1 merge.sam > merge.bam;
samtools sort merge.bam -o merge_sort.bam > merge_sort.bam;
sambamba view -h -t 2 -f bam -F "[XS] == null and not unmapped  and not duplicate" merge_sort.bam  > merge_filter.bam
samtools index  merge_filter.bam

bedtools bamtobed -i merge_filter.bam > merge_filter.bed;
bedtools subtract -a merge_filter.bed -b /home/Genome/mm10/mm10-blacklist.v2.bed.gz > merge_no_blacklist.bed;

SICER.sh ./ merge_no_blacklist.bed input_no_blacklist.bed ./ mm10 1 1000 250 0.8 5000 1e-3
ADD REPLY
0
Entering edit mode

Is there actually something in merge_sort.bam. The sort command writes to merge_sort.bam twice, once from -o and once from stdout. Would not surprise me if the file was corrupted. What is the output of samtools flagstat merge_sort.bam? Really try epic2, it accepts bam, no need for this fiddling with BED files.

ADD REPLY
0
Entering edit mode

Yes, merge_sort.bam is a very large file. The output of samtools flagstat merge_sort.bam is :

839574239 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
839574239 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

And thanks for your suggestion, I will try epic2.

ADD REPLY
0
Entering edit mode

Almost a billion reads, wow, is that intended? For ChIP-seq? Be sure to run a duplicate marking algorithm before calling your peaks. At that depth it is almost certain that you have a lot of duplicates. I do not remember whether epic2 has a duplicate removal algorithm built-in.

ADD REPLY
0
Entering edit mode

Yes, it is. That's because I merged over 50 fastq files into a single file cat *.fastq.gz > merge.fastq.gz . SICER can remove redundant reads, so epic2 probably also can do duplicate removal.

ADD REPLY
0
Entering edit mode

Was data from the same sample?

ADD REPLY
0
Entering edit mode

No, it's from differernt samples. I'm doing so because I want to get a list of peaks that include peaks of all samples.

ADD REPLY
0
Entering edit mode

I think you need to use something like DROMPAplus (just as an example) to analyze multiple samples. Concatenating all samples into a single file does not sound like the correct way to do this. A manual is available here.

ADD REPLY
0
Entering edit mode

Okay, thanks for your information, I will look into that.

ADD REPLY
0
Entering edit mode

There are multiple ways to do that. First, catting all files together and then call peaks is acceptable if you plan to use the peak set for differential analysis. The paper on the csaw tool describes in great detail several peak calling strategies. Alternatively, use a peak caller that uses replicates, such as PePr (requires an input sample) or Genrich to get peaks that are reproducible across replicates. This obviously is only a good idea when samples belong to the same treatment group. Third, do the latter for each group separately, then merge peaks. Many ways to do that, and it depends n which question you want to answer. If these are independent replicates you want to discard replicates per file but then keep those reads that are duplicates across files since these cannot be PCR redundancies. I would spend some time reading literature and previously posts on ChIP-seq before deciding for a pipeline. Then choose the one suitable to answer your scientific question.

ADD REPLY
0
Entering edit mode

Yep, my goal is to generate a peak set for differential analysis. I've trid merging peaks using the code below. But I ended up getting super-long peaks. And this introduced many false postives to the downstream differential analysis. I will look into the csaw paper you mentioned. Thanks!

cat *.bed > consensus_peaks.bed
sortBed -i consensus_peaks.bed > consensus_peaks_sort.bed;
mergeBed -i consensus_peaks_sort.bed > consensus_peaks_merged.bed
ADD REPLY

Login before adding your answer.

Traffic: 1208 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6