Hi, I have a some ChIP-seq data sets (FASTQ files) that I want to prepare for peak calling. I did align the reads using bowtie2 -x <bowtie_index> -1 <input_fastq_file> -2 <paired_input_fastq_file> -S <output_file.sam>
, filtered the SAM file, converted to BAM and removed the PCR duplicates. I do need the files to be in BED format, but reading through the documentation of the peak calling tools that I want to use, they specify that the input files should be in BEDPE format if I want to perform the analysis on pair-ended reads. The question is: do I still need to convert to BEDPE if I already used BOWTIE2 in the pair-ended "mode"? I mean, I have the information about the mates in the BAM files, but when I convert to BEDPE I get a bunch of ****WARNING: Query ....... is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
The output of samtools flagstat <input_file.bam>
on one of the files passed through the aforementioned pipeline is:
47362874 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
47362874 + 0 mapped (100.00% : N/A)
47362874 + 0 paired in sequencing
24186824 + 0 read1
23176050 + 0 read2
47074699 + 0 properly paired (99.39% : N/A)
47148896 + 0 with itself and mate mapped
213978 + 0 singletons (0.45% : N/A)
44868 + 0 with mate mapped to a different chr
44695 + 0 with mate mapped to a different chr (mapQ>=5)
Thank you for your help.
Hi Alexander. Thanks for the detailed and useful information. Reading through the
bedtools bamtobed
documentation I see indeed that they specify that if you want to use the-bedpe
option, the BAM file should be sorted by read name. I was usingsamtools sort -o <output_file>
without the-n
option. I agree the BEDPE format looks bulky. Nevertheless, my question is still the same: do I really need to create BEDPE files if in the alignment process I used bowtie2 in the pair-ended "mode" or the normal BED format is enough for the peak calling? Thanks!A BEDPE file created from the bowtie2 paired BAM file is a necessary intermediate from which you can create your BED file, as suggested above.
Hi Ian, thanks for your reply. So, I actually have to create BEDPE files from the output of bowtie then. Isn't bowtie doing the 'right" thing with the pair-ended files while creating the BAM file and putting both mates as entries in the file? (BTW, I do not need to create a BED from the BEDPE)
Lets make it simple: Which peak caller are you using? I am asking because it highly depends on how the program reads in your data, so essentially if it is smart enough to recognize a paired-end format or not. You use MACS?
I am using both MACS and JAMM. Regarding pair-ended reads, MACS is smart enough to detect them. For JAMM, I need BEDPE (or BED if it is formatted to contain all pair-end information in the first 3 columns..). I want to use JAMM also because supposedly it's able to identify narrower peaks than MACS (i.e., discriminate between two peaks being close to each other and not consider them as only one peak).
OK. I've performed a bunch of tests, but still no decent conclusion has been reached. If I use single-end reads as input in MACS2, from both
BAM
andBED
formats I get identical resulting peak files. If I use pair-ended reads inBAMPE
andBEDPE
I get different resulting peak files using the same data set. Line count peak file:using
BAMPE
: 58330using "classic"
BED
created from BAMPE: 90685using
BEDPE
created withsamtools sort -l 0 -n -O bam <input_bampe_file> | bedtools bamtobed -bedpe -i - > <output_bedpe_file>
: 67720using
BEDPE
created withsamtools sort -l 0 -n -O bam <input_bampe_file> | samtools view -bf 0x2 | bedtools bamtobed -bedpe -i - > <output_bedpe_file>
(which keeps only the properly-paired reads) : 67711using not the standard
BEDPE
format, butcut -f 1,2,6 <input_bedpe_file> | sort -k1,1 -k2,2n -k3,3n > <output_bed_file>
: 60998:/ reading through the doc of MACS2 and it looks like (still not clear enough for me) they use a special BEDPE format which should be in line with what Alexander suggested and I used in the last test: "The BEDPE format is a simplified and more flexible BED format, which only contains the first three columns defining the chromosome name, left and right position of the fragment from Paired-end sequencing. Note, this is NOT the same format used by BEDTOOLS, and BEDTOOLS version of BEDPE is actually not in a standard BED format. " Any idea why?
I still did not manage to get same results between the two formats as it was the case for single-end reads when passing BAM and BED
What 'd' values do you get? The mean fragment length for single end reads is determined by cross-correlation, and for paired-end the mean of the known fragment lengths is used. You could use the same value of 'd' by using --nomodel --extsize 200.
Hi Ian, sorry for the late reply. I will give it a try and come back with an answer in the following days. Meanwhile, I've discovered that some reads were mapped on different chromosomes (i.e., 1st read on chr1 for instance and the corresponding paired read on chr16). That, I guess it's
bowtie
. I have removed those lines and will perform new test as soon as I have some time.Hi Ian, I have tried with the suggested parameter settings, but to no success. I have just accepted the fact that it does not produce the same results using BAMPE and BEDPE. I have contacted the developers/supporting team several times about this issue and other encountered on the way, but they did not bother to reply.