prepare pair-ended reads for peak calling
1
1
Entering edit mode
7.7 years ago
Marius ▴ 30

Hi, I have a some ChIP-seq data sets (FASTQ files) that I want to prepare for peak calling. I did align the reads using bowtie2 -x <bowtie_index> -1 <input_fastq_file> -2 <paired_input_fastq_file> -S <output_file.sam>, filtered the SAM file, converted to BAM and removed the PCR duplicates. I do need the files to be in BED format, but reading through the documentation of the peak calling tools that I want to use, they specify that the input files should be in BEDPE format if I want to perform the analysis on pair-ended reads. The question is: do I still need to convert to BEDPE if I already used BOWTIE2 in the pair-ended "mode"? I mean, I have the information about the mates in the BAM files, but when I convert to BEDPE I get a bunch of ****WARNING: Query ....... is marked as paired, but its mate does not occur next to it in your BAM file. Skipping. The output of samtools flagstat <input_file.bam> on one of the files passed through the aforementioned pipeline is:

47362874 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
47362874 + 0 mapped (100.00% : N/A)
47362874 + 0 paired in sequencing
24186824 + 0 read1
23176050 + 0 read2
47074699 + 0 properly paired (99.39% : N/A)
47148896 + 0 with itself and mate mapped
213978 + 0 singletons (0.45% : N/A)
44868 + 0 with mate mapped to a different chr
44695 + 0 with mate mapped to a different chr (mapQ>=5)

Thank you for your help.

ChIP-Seq pair-ended bedtools • 4.0k views
ADD COMMENT
2
Entering edit mode
7.7 years ago
ATpoint 85k

Your file seems to be sorted by coordinate, not by name. Bedtools bamtobed required input files being name-sorted, e.g. by samtools sort with the -n option. You can combine the process in a simple pipeline:

samtools sort -l 0 -n -O bam input.bam | bedtools bamtobed -bedpe  -i - > out.bedpe

I personally find the bedpe format a bit bulky. In case you want a normal bed with the paired-end information, you can use the Unix cut command to create a conventional bed from a bedpe, simply using the end of the reverse mate read as end coordinate:

cut -f 1,2,6,7,8,9 in.bedpe | gsort -k1,1 -k2,2n -k3,3n > out.bed

or as a pipe directly from bedtools:

samtools sort -l 0 -n -O bam input.bam | bedtools bamtobed -bedpe  -i - |  cut -f 1,2,6,7,8,9  | gsort -k1,1 -k2,2n -k3,3n > out.bed

This is useful for applications like bedtools intersect, when you want to use the paired-end information for intersection operations.

ADD COMMENT
0
Entering edit mode

Hi Alexander. Thanks for the detailed and useful information. Reading through the bedtools bamtobed documentation I see indeed that they specify that if you want to use the -bedpe option, the BAM file should be sorted by read name. I was using samtools sort -o <output_file> without the -n option. I agree the BEDPE format looks bulky. Nevertheless, my question is still the same: do I really need to create BEDPE files if in the alignment process I used bowtie2 in the pair-ended "mode" or the normal BED format is enough for the peak calling? Thanks!

ADD REPLY
0
Entering edit mode

A BEDPE file created from the bowtie2 paired BAM file is a necessary intermediate from which you can create your BED file, as suggested above.

ADD REPLY
0
Entering edit mode

Hi Ian, thanks for your reply. So, I actually have to create BEDPE files from the output of bowtie then. Isn't bowtie doing the 'right" thing with the pair-ended files while creating the BAM file and putting both mates as entries in the file? (BTW, I do not need to create a BED from the BEDPE)

ADD REPLY
0
Entering edit mode

Lets make it simple: Which peak caller are you using? I am asking because it highly depends on how the program reads in your data, so essentially if it is smart enough to recognize a paired-end format or not. You use MACS?

ADD REPLY
0
Entering edit mode

I am using both MACS and JAMM. Regarding pair-ended reads, MACS is smart enough to detect them. For JAMM, I need BEDPE (or BED if it is formatted to contain all pair-end information in the first 3 columns..). I want to use JAMM also because supposedly it's able to identify narrower peaks than MACS (i.e., discriminate between two peaks being close to each other and not consider them as only one peak).

ADD REPLY
0
Entering edit mode

OK. I've performed a bunch of tests, but still no decent conclusion has been reached. If I use single-end reads as input in MACS2, from both BAM and BED formats I get identical resulting peak files. If I use pair-ended reads in BAMPE and BEDPE I get different resulting peak files using the same data set. Line count peak file:

using BAMPE: 58330

using "classic" BED created from BAMPE: 90685

using BEDPE created with samtools sort -l 0 -n -O bam <input_bampe_file> | bedtools bamtobed -bedpe -i - > <output_bedpe_file>: 67720

using BEDPE created with samtools sort -l 0 -n -O bam <input_bampe_file> | samtools view -bf 0x2 | bedtools bamtobed -bedpe -i - > <output_bedpe_file> (which keeps only the properly-paired reads) : 67711

using not the standard BEDPE format, but cut -f 1,2,6 <input_bedpe_file> | sort -k1,1 -k2,2n -k3,3n > <output_bed_file>: 60998

:/ reading through the doc of MACS2 and it looks like (still not clear enough for me) they use a special BEDPE format which should be in line with what Alexander suggested and I used in the last test: "The BEDPE format is a simplified and more flexible BED format, which only contains the first three columns defining the chromosome name, left and right position of the fragment from Paired-end sequencing. Note, this is NOT the same format used by BEDTOOLS, and BEDTOOLS version of BEDPE is actually not in a standard BED format. " Any idea why?

I still did not manage to get same results between the two formats as it was the case for single-end reads when passing BAM and BED

ADD REPLY
0
Entering edit mode

What 'd' values do you get? The mean fragment length for single end reads is determined by cross-correlation, and for paired-end the mean of the known fragment lengths is used. You could use the same value of 'd' by using --nomodel --extsize 200.

ADD REPLY
0
Entering edit mode

Hi Ian, sorry for the late reply. I will give it a try and come back with an answer in the following days. Meanwhile, I've discovered that some reads were mapped on different chromosomes (i.e., 1st read on chr1 for instance and the corresponding paired read on chr16). That, I guess it's bowtie. I have removed those lines and will perform new test as soon as I have some time.

ADD REPLY
0
Entering edit mode

Hi Ian, I have tried with the suggested parameter settings, but to no success. I have just accepted the fact that it does not produce the same results using BAMPE and BEDPE. I have contacted the developers/supporting team several times about this issue and other encountered on the way, but they did not bother to reply.

ADD REPLY

Login before adding your answer.

Traffic: 2458 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6