Can't convert paired end BAM to bed using bedtools
1
4
Entering edit mode
9.4 years ago
colin.kern ★ 1.1k

I'm trying to convert a bam file with paired end read alignments to bed format, which is required by the F-Seq peak calling program. The bedtools bamToBed program is complaining that reads are marked as paired, but their mate doesn't appear next to it in the bam file. This is after running samtools sort -n on the bam file, which is what bedtools says to do to fix this. The first 10 lines from the bam are below.

HISEQ-2500-1:79:C62NNANXX:4:1101:1216:7741      163     gi|358485510|ref|NC_006089.3|   88359766        60      36M     =     88359773  43      AGTTGAAAAAATGTGCTCAGGCTTCCTATTGGTTAA BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    X0:i:1  X1:i:0  MD:Z:36 PG:Z:MarkDuplicates     XG:i:0  AM:i:37 NM:i:0  SM:i:37      XM:i:0  XO:i:0  XT:A:U
HISEQ-2500-1:79:C62NNANXX:4:1101:1216:7741      83      gi|358485510|ref|NC_006089.3|   88359773        60      36M     =     88359766  -43     AAAATGTGCTCAGGCTTCCTATTGGTTAAGTGAAAC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB    X0:i:1  X1:i:0  MD:Z:36 PG:Z:MarkDuplicates     XG:i:0  AM:i:37 NM:i:0  SM:i:37      XM:i:0  XO:i:0  XT:A:U
HISEQ-2500-1:79:C62NNANXX:4:1101:1216:12449     163     gi|358485505|ref|NC_006094.3|   33794145        60      36M     =     33794149  40      CAAATGGCTTTGGACTGCTTCTGCCAAATCCCAAGC BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    X0:i:1  X1:i:0  MD:Z:36 PG:Z:MarkDuplicates     XG:i:0  AM:i:37 NM:i:0  SM:i:37      XM:i:0  XO:i:0  XT:A:U
HISEQ-2500-1:79:C62NNANXX:4:1101:1216:12449     83      gi|358485505|ref|NC_006094.3|   33794149        60      36M     =     33794145  -40     TGGCTTTGGACTGCTTCTGCCAAATCCCAAGCCCAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB    X0:i:1  X1:i:0  MD:Z:36 PG:Z:MarkDuplicates     XG:i:0  AM:i:37 NM:i:0  SM:i:37      XM:i:0  XO:i:0  XT:A:U
HISEQ-2500-1:79:C62NNANXX:4:1101:1216:21049     99      gi|358485510|ref|NC_006089.3|   145399305       60      36M     =     145399308 39      CCACTGATATTTACTCTTCCTAAGCTCTAACTACAG BBBBBFBFFFFFFFFFFFFFFF<FFFFFBFFFFFFF    X0:i:1  X1:i:0  MD:Z:36 PG:Z:MarkDuplicates     XG:i:0  AM:i:37 NM:i:0  SM:i:37      XM:i:0  XO:i:0  XT:A:U
HISEQ-2500-1:79:C62NNANXX:4:1101:1216:21049     147     gi|358485510|ref|NC_006089.3|   145399308       60      36M     =     145399305 -39     CTGATATTTACTCTTCCTAAGCTCTAACTACAGCTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB    X0:i:1  X1:i:0  MD:Z:36 PG:Z:MarkDuplicates     XG:i:0  AM:i:37 NM:i:0  SM:i:37      XM:i:0  XO:i:0  XT:A:U
HISEQ-2500-1:79:C62NNANXX:4:1101:1217:14270     163     gi|358485497|ref|NC_006102.3|   9271537 60      36M     =       9271612 110     GGTTTGGACCACTGCTTTGGTTGTTTATGAGCAGAG <<BBBFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFF    X0:i:1  X1:i:0  MD:Z:36 PG:Z:MarkDuplicates     XG:i:0  AM:i:37 NM:i:0  SM:i:37 XM:i:0       XO:i:0  XT:A:U
HISEQ-2500-1:79:C62NNANXX:4:1101:1217:14270     83      gi|358485497|ref|NC_006102.3|   9271612 60      1S35M   =       9271537 -110    CCAATGCCACAGAAAGGAACGTTCCTGTTTTGCCTA #FFFFF/FBFB<FFFFFFFFFFB<FFFFFFF<//BB    X0:i:1  X1:i:0  XC:i:35 MD:Z:35 PG:Z:MarkDuplicates     XG:i:0  AM:i:37 NM:i:0  SM:i:37      XM:i:0  XO:i:0  XT:A:U
HISEQ-2500-1:79:C62NNANXX:4:1101:1217:18107     99      gi|358485509|ref|NC_006090.3|   33014530        60      36M     =     33014585  91      TAATTGAAAGTCTCAAATAGTATGTGACATGCCGTT BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    X0:i:1  X1:i:0  MD:Z:36 PG:Z:MarkDuplicates     XG:i:0  AM:i:25 NM:i:0  SM:i:37      XM:i:0  XO:i:0  XT:A:U
HISEQ-2500-1:79:C62NNANXX:4:1101:1217:18107     147     gi|358485509|ref|NC_006090.3|   33014585        60      36M     =     33014530  -91     CTGATGTATGGTGGCTGCTGAAACTGCTTGCAAAAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB    X0:i:1  X1:i:0  MD:Z:28C0T6     PG:Z:MarkDuplicates   XG:i:0    AM:i:25 NM:i:2       SM:i:25 XM:i:2  XO:i:0  XT:A:U
alignment bedtools bam • 16k views
ADD COMMENT
0
Entering edit mode

What is the output of

samtools flagstat YOURFILE.bam
ADD REPLY
0
Entering edit mode
36416130 + 0 in total (QC-passed reads + QC-failed reads)
3036198 + 0 duplicates
36413607 + 0 mapped (99.99%:-nan%)
36416130 + 0 paired in sequencing
18740866 + 0 read1
17675264 + 0 read2
32728595 + 0 properly paired (89.87%:-nan%)
34543247 + 0 with itself and mate mapped
1870360 + 0 singletons (5.14%:-nan%)
379020 + 0 with mate mapped to a different chr
379020 + 0 with mate mapped to a different chr (mapQ>=5)
ADD REPLY
0
Entering edit mode

Well, according to samtools, most of your reads are properly paired. Maybe bamToBed needs all reads to be properly paired (I never used it)? You may try to filter only properly paired reads and pipe them to bamToBed.

ADD REPLY
0
Entering edit mode

Hi I did the same step too, but the problem persist. I got the bam file from samtools after bowtie pe alignment, getting this results

16801609 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3297663 + 0 mapped (19.63%:-nan%)
16801609 + 0 paired in sequencing
8395654 + 0 read1
8405955 + 0 read2
3297663 + 0 properly paired (19.63%:-nan%)
3297663 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

then did the step that you propose getting this warning

WARNING: Query HWI-ST863:339:C43P1ACXX:4:1206:6989:46429 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.

I know that is a warning but the bedpe file is empty. If anyone knows or any advice I really appreciate.

ADD REPLY
0
Entering edit mode

Hello, you have to sort your BAM file by read name first, have a look at my previous answer above to see how to do that.

ADD REPLY
0
Entering edit mode

Hi, I did that you say, step by step:

samtools sort -n AR_PE.bam AR_PE.sorted
samtools fixmate AR_PE.sorted.bam AR_PE.fixed.bam
samtools sort AR_PE.fixed.bam AR_PE.fixed.sorted
bedtools bamtobed -i AR_PE.fixed.sorted.bam -bedpe > AR_PE.bedpe

and tried this too

samtools view -bf 0x2 AR_PE.fixed.sorted.bam | bedtools bamtobed -i stdin -bedpe > AR_PE.bedpe

and get the same warning, but this time is generated a bedpe file, but when I opened in IGV didn't show nothing. I don't know exactly if my bam file is corrupt or what else. The comand to generate the sam file was:

bowtie ref_genome -v 3 -X 20000 -S --best -p 6 -1 AR1.fastq -2 AR2.fastq > AR_PE.sam
ADD REPLY
0
Entering edit mode

You have to leave the BAM file sorted by read name in order to generate a BEDPE file. When you sort by read name it places the alignments from each paired read on consecutive lines, this allows bedtools to scan through the file every two lines and produce a BEDPE record. It is much more efficient then having to keep track of paired reads which could appear anywhere in the BAM file.

ADD REPLY
0
Entering edit mode

Thanks, now is generating well the bedpe without warnings. Now my question is about if should I transform the bedpe to another format as a wig maybe to visualize in IGV, because IGV opened wrongly the bedpe file and I am not acquainted with IGV.

ADD REPLY
0
Entering edit mode

To the best of my knowledge I don't think IGV can display BEDPE files. If you want to look at the coverage either load in the BAM file directly or create a bedGraph file from the BAM file (BEDtools genomecov command can do this)

ADD REPLY
0
Entering edit mode

Hi, thanks for reply. Your answer was very useful. I know that maybe correspond to another post but I need some help. Now I'm trying to get the genomecov as you recommend but have problem with the -g option, I don't know what kind of genome file is required, I read that an UCSC genome is required but my genome is not there. If you know how to make it I would appreciate.

ADD REPLY
0
Entering edit mode

Hi there, a genome file is a tab-delimited file which lists the chromosome name in the first column and the length of the chromosome in the second column. See this Biostars post for details.

ADD REPLY
0
Entering edit mode

Hi I am in a similar situation, but my problem is that the bedpe file is empty while there no warning or error during the bedtools bamtobed.

samtools view -bf 0x2 PE.sorted.fixed.bam | bedtools bamtobed -i stdin -bedpe > PE.bedpe

My PE.sorted.fixed.bam file is already been sorted: @HD VN:1.0 SO:queryname

Do you know what might be the problem?

ADD REPLY
4
Entering edit mode
9.4 years ago
James Ashmore ★ 3.5k

Are you trying to create a BEDPE file or a BED file? If you want a BED file then it shouldn't complain about pairs not being next to each other, unless you are filtering for properly-paired reads. If you want a BEDPE file however make sure you fix the SAM flags after any filtering, such as removal of PCR duplicates or low-quality alignments. These actions remove reads but don't update the SAM flags. Try the following:

# Sort by read name
samtools sort -n reads.bam reads.sorted

# Update/fix SAM flags
samtools fixmate reads.sorted.bam reads.fixed.bam

# Sort by coordinates
samtools sort reads.fixed.bam reads.fixed.sorted

# convert all reads, or...
bedtools bamtobed -i reads.fixed.sorted.bam -bedpe > reads.bedpe

# ...convert properly-paired reads
samtools view -bf 0x2 reads.fixed.sorted.bam | bedtools bamtobed -i stdin -bedpe > reads.bedpe
ADD COMMENT
0
Entering edit mode

I'm trying to create a BEDPE file. I did the steps you posted but my reads.bedpe is only ~5 million lines, where my bam file has ~33 million properly paired reads. So I think there are a lot of alignments missing. If I just created a BED file, I lose the information about paired ends? I imagine this is going to affect the results of F-Seq.

ADD REPLY
0
Entering edit mode

Could you double check your read counts by re-running the samtools flagstat command on the fixed BAM file? Also remember that the BEDPE format lists each pair as a single line, so according to the number of lines you've quoted that means you have ~5 million pairs, which means you have ~10 million reads which form pairs.

ADD REPLY
0
Entering edit mode

Ok, the read counts from the fixed BAM file match with the bamtobed output. But now I'm wondering why the number got reduced by so much.

36416130 + 0 in total (QC-passed reads + QC-failed reads)
3036198 + 0 duplicates
36413607 + 0 mapped (99.99%:-nan%)
36416130 + 0 paired in sequencing
18740866 + 0 read1
17675264 + 0 read2
10044380 + 0 properly paired (27.58%:-nan%)
10885970 + 0 with itself and mate mapped
25527637 + 0 singletons (70.10%:-nan%)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)
ADD REPLY
1
Entering edit mode

You have ~3 million PCR duplicates, and ~25 million singletons (reads which mapped but their mate did not). Only ~10 million actually map as you would expect in pairs which are within a reasonable distance apart (Bowtie2 usually sets a default of 500bp). I'd suggest investigating why you have so many singletons. Possible that you removed one read from a pair from the previous filtering steps?

ADD REPLY

Login before adding your answer.

Traffic: 2548 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6