bedtools subtract still overlaps by 1bp?
1
0
Entering edit mode
5.8 years ago
Anand Rao ▴ 640

I need to extract exactly 50bp upstream and downstream of a class of features dispersed across the genome, concatenate every pair to make a chimeric DNA sequence, up to 100bp in length.

Another biostars forum page suggested using a "slop | subtract | getfasta" pipeline of bedtools.

slopBed -s -i orig.bed -g genome.fai -b 50 > SLOP50.bed

slop works OK but

subtractBed -s -a <slop.bed> -b <original.bed>

yields output that appears to overlap by 1bp across all lines.

And at subtract tools page - https://bedtools.readthedocs.io/en/latest/content/tools/subtract.html the examples also show this overlap by 1bp.

$ cat A.bed
chr1  10   20
chr1  100  200

$ cat B.bed
chr1  0    30
chr1  180  300

$ bedtools subtract -a A.bed -b B.bed
chr1  100  180

But shouldn't the output, in this example, ideally be

bedtools subtract -a A.bed -b B.bed
chr1  100  179

Or am I misunderstanding something?

If not, then my chimera will have 2 additional nt - 1 each from each flanking end, which is not desirable.

So, is there a simple workaround or add-on to achieve what I have set out to do - make my 2 * 50bp chimeric DNA sequence? Thanks!

This is cross posted at https://github.com/arq5x/bedtools2/issues/680. So feel free to answer here or there. Thanks!

Actual bed files line snippets are shown below

ORIG

Chr_01  342230  360778  Aco_TE_F_1-0,15:5   15:5    +   HSv1.1  TE  .   ID=Aco_TE_F_1-0,15:5
Chr_01  1494010 1496609 Aco_TE_R_1-0,10:7   10:7    -   HSv1.1  TE  .   ID=Aco_TE_R_1-0,10:7

SLOP50

Chr_01  342180  360828  Aco_TE_F_1-0,15:5   15:5    +   HSv1.1  TE  .   ID=Aco_TE_F_1-0,15:5
Chr_01  1493960 1496659 Aco_TE_R_1-0,10:7   10:7    -   HSv1.1  TE  .   ID=Aco_TE_R_1-0,10:7

SUBTRACT = SLOP50 - ORIGINAL

Chr_01  342180  342230  Aco_TE_F_1-0,15:5   15:5    +   HSv1.1  TE  .   ID=Aco_TE_F_1-0,15:5
Chr_01  360778  360828  Aco_TE_F_1-0,15:5   15:5    +   HSv1.1  TE  .   ID=Aco_TE_F_1-0,15:5
Chr_01  1493960 1494010 Aco_TE_R_1-0,10:7   10:7    -   HSv1.1  TE  .   ID=Aco_TE_R_1-0,10:7
Chr_01  1496609 1496659 Aco_TE_R_1-0,10:7   10:7    -   HSv1.1  TE  .   ID=Aco_TE_R_1-0,10:7
bedtools getfasta slop subtract • 2.6k views
ADD COMMENT
3
Entering edit mode
5.8 years ago

Or am I misunderstanding something?

I guess you are missing, that the intervals in bed files are 0-based half-open intervals. This means the positions starts counting by 0, the start coordinate is included but not the end coordinate.

Let's have again a look on your example.

The interval chr1 100 200 in A.bed doesn't no include the position 200. But the interval chr1 180 300 in B.bed includes position 200. This is why the result after subtracting interval B from interval A is correct with chr1 100 180. Because this description mean, that position 180 is the first one not included in the interval.

fin swimmer

ADD COMMENT
0
Entering edit mode

Ah! half-open interval - OK, I missed that. Thank you, for that useful and illustrative example!

So for my purposes, would getFasta from the subtractBed results give the correctly extracted fasta sequence, because getfasta can distinguish between gff3 input Vs. bed input? Is gff3 fomat 1-based, and full-open intervals?

Ultimately I want to insure that the 50bp flanks, when they are concatenated, do not have more or less bases.Your advice?

ADD REPLY
1
Entering edit mode

Hello again,

gff3 is 1-based and uses full-closed intervalls, meaning the start and end coordinate are included.

So if in interval in bed is chr1 100 200 it would be in gff3 chr1 101 200.

bedtools getfasta can handle both filetypes.

fin swimmer

ADD REPLY
0
Entering edit mode

Crystal clear, thank you again for those simple, yet elegant examples...getfasta worked great with my bed files.

ADD REPLY

Login before adding your answer.

Traffic: 1640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6