I need to extract exactly 50bp upstream and downstream of a class of features dispersed across the genome, concatenate every pair to make a chimeric DNA sequence, up to 100bp in length.
Another biostars forum page suggested using a "slop | subtract | getfasta" pipeline of bedtools.
slopBed -s -i orig.bed -g genome.fai -b 50 > SLOP50.bed
slop works OK but
subtractBed -s -a <slop.bed> -b <original.bed>
yields output that appears to overlap by 1bp across all lines.
And at subtract tools page - https://bedtools.readthedocs.io/en/latest/content/tools/subtract.html the examples also show this overlap by 1bp.
$ cat A.bed
chr1 10 20
chr1 100 200
$ cat B.bed
chr1 0 30
chr1 180 300
$ bedtools subtract -a A.bed -b B.bed
chr1 100 180
But shouldn't the output, in this example, ideally be
bedtools subtract -a A.bed -b B.bed
chr1 100 179
Or am I misunderstanding something?
If not, then my chimera will have 2 additional nt - 1 each from each flanking end, which is not desirable.
So, is there a simple workaround or add-on to achieve what I have set out to do - make my 2 * 50bp chimeric DNA sequence? Thanks!
This is cross posted at https://github.com/arq5x/bedtools2/issues/680. So feel free to answer here or there. Thanks!
Actual bed files line snippets are shown below
ORIG
Chr_01 342230 360778 Aco_TE_F_1-0,15:5 15:5 + HSv1.1 TE . ID=Aco_TE_F_1-0,15:5
Chr_01 1494010 1496609 Aco_TE_R_1-0,10:7 10:7 - HSv1.1 TE . ID=Aco_TE_R_1-0,10:7
SLOP50
Chr_01 342180 360828 Aco_TE_F_1-0,15:5 15:5 + HSv1.1 TE . ID=Aco_TE_F_1-0,15:5
Chr_01 1493960 1496659 Aco_TE_R_1-0,10:7 10:7 - HSv1.1 TE . ID=Aco_TE_R_1-0,10:7
SUBTRACT = SLOP50 - ORIGINAL
Chr_01 342180 342230 Aco_TE_F_1-0,15:5 15:5 + HSv1.1 TE . ID=Aco_TE_F_1-0,15:5
Chr_01 360778 360828 Aco_TE_F_1-0,15:5 15:5 + HSv1.1 TE . ID=Aco_TE_F_1-0,15:5
Chr_01 1493960 1494010 Aco_TE_R_1-0,10:7 10:7 - HSv1.1 TE . ID=Aco_TE_R_1-0,10:7
Chr_01 1496609 1496659 Aco_TE_R_1-0,10:7 10:7 - HSv1.1 TE . ID=Aco_TE_R_1-0,10:7
Ah! half-open interval - OK, I missed that. Thank you, for that useful and illustrative example!
So for my purposes, would getFasta from the subtractBed results give the correctly extracted fasta sequence, because getfasta can distinguish between gff3 input Vs. bed input? Is gff3 fomat 1-based, and full-open intervals?
Ultimately I want to insure that the 50bp flanks, when they are concatenated, do not have more or less bases.Your advice?
Hello again,
gff3
is 1-based and uses full-closed intervalls, meaning the start and end coordinate are included.So if in interval in
bed
ischr1 100 200
it would be in gff3chr1 101 200
.bedtools getfasta
can handle both filetypes.fin swimmer
Crystal clear, thank you again for those simple, yet elegant examples...getfasta worked great with my bed files.