Question

How to Bedtools intersect two different sorted BED files

0

Entering edit mode

4.7 years ago

caro-ca ▴ 20

Hey! I'm trying to determine the absence of transposon elements on a non-reference genome by looking at split reads. For that, I mapped the reference genome against a pair of reads using STAR. Consequently, I sorted and indexed the resulting bam file and converted split junction reads with Bam2hints. The output of Bam2hints was a gff file that was converted to a bed file by using gff2bed

gff2bed < myfile.gff > my_sorted_file.bed

Finally, to count if there are hints/junctions spanning a TEs I tried to use Bedtools intersect:

bedtools intersect -a TE_file.bed -b my_sorted_file.bed

(TE_file.bed is the bed file containing the transposon element coordinates)

These 2 Bed files look like this:
- TE_file.bed

 chrI    22231   22552   chrI_s1+_TY1    .       +       reannotate      transposable_element    .       ID=chrI_s1+_TY1
    chrI    138753  138990  chrI_s2-_TY1    .       -       reannotate      transposable_element    .       ID=chrI_s2-_TY1
    chrI    160105  160237  chrI_st1-_TY1   .       -       reannotate      transposable_element    .       ID=chrI_st1-_TY1
    chrI    160238  166163  chrI_u1-_TY1    .       -       reannotate      transposable_element    .       ID=chrI_u1-_TY1
    chrI    182613  182953  chrI_s4+_TY3    .       +       reannotate      transposable_element    .       ID=chrI_s4+_TY3

-my_sorted_file.bed

 2micron 1674    4928    .       0       .       b2h     intron  .       pri=4;src=E
chrI    5028    15053   .       0       .       b2h     intron  .       pri=4;src=E
chrI    6306    7155    .       0       .       b2h     intron  .       mult=2;pri=4;src=E
chrI    6826    16996   .       0       .       b2h     intron  .       pri=4;src=E
chrI    6827    9072    .       0       .       b2h     intron  .       mult=2;pri=4;src=E
chrI    10926   13026   .       0       .       b2h     intron  .       pri=4;src=E
chrI    12053   24485   .       0       .       b2h     intron  .       mult=2;pri=4;src=E
chrI    12061   21076   .       0       .       b2h     intron  .       pri=4;src=E

A piece of the output looks like this:

***** WARNING: File Gallone_hints.bed has inconsistent naming convention for record:
chrI    5028    15053   .   0   .   b2h intron  .   pri=4;src=E

chrI    22231   22552   chrI_s1+_TY1    .   +   reannotate  transposable_element    ID=chrI_s1+_TY1
chrI    138753  138990  chrI_s2-_TY1    .   -   reannotate  transposable_element    ID=chrI_s2-_TY1
chrI    160105  160237  chrI_st1-_TY1   .   -   reannotate  transposable_element    ID=chrI_st1-_TY1
chrI    160238  166163  chrI_u1-_TY1    .   -   reannotate  transposable_element    ID=chrI_u1-_TY1

How can I solve this problem? Which columns should I remove, reorder? Thank you in advance.

bedtools • 1.5k views

ADD COMMENT • link updated 4.7 years ago by ATpoint 85k • written 4.7 years ago by caro-ca ▴ 20

score 2 · Accepted Answer · 2020-02-28

2

Entering edit mode

4.7 years ago

ATpoint 85k

I guess the issue is that some chromosomes start with a character and some start with a number. The simplest would be temporarily add any dummy prefix to the chromosome column so that all chromosomes start with a character. Then after the intersect simply remove it again:

bedtools intersect \
  -a <(awk 'OFS="\t" {$1="DUMMYNAME"$1} {print'} file1.bed) \
  -b <(awk 'OFS="\t" {$1="DUMMYNAME"$1} {print'} file2.bed) \
  | awk 'FS=OFS="\t" {gsub("DUMMYNAME","");print}' > output.bed

ADD COMMENT • link 4.7 years ago by ATpoint 85k

0

Entering edit mode

Thank you for your reply. Since there was one chromosome starting with a number I just added a character to it. Your code did not work for me so I did the adding manually. This is the stdout:

bedtools intersect -a Gallone_hints.bed -b reference_TE_locations.bed 
    ***** WARNING: File Gallone_hints.bed has inconsistent naming convention for record:
    c2micron    1674    4928    .   0   .   b2h intron  .   pri=4;src=E

    chrI    22231   22552   .   0   .   b2h intron  .   mult=2;pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   mult=3;pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   mult=3;pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   mult=6;pri=4;src=E
    chrXVI  856550  856747  .   0   .   b2h intron  .   mult=3;pri=4;src=E
    chrXVI  850624  856550  .   0   .   b2h intron  .   mult=3;pri=4;src=E
    chrXVI  856550  856881  .   0   .   b2h intron  .   mult=22;pri=4;src=E
    chrXVI  850624  856550  .   0   .   b2h intron  .   mult=22;pri=4;src=E
    chrXVI  933066  933400  .   0   .   b2h intron  .   pri=4;src=E
    chrXVI  933538  933675  .   0   .   b2h intron  .   pri=4;src=E
    chrXVI  933066  933400  .   0   .   b2h intron  .   pri=4;src=E
    chrXVI  933538  933675  .   0   .   b2h intron  .   pri=4;src=E
    chrXVI  933538  933675  .   0   .   b2h intron  .   pri=4;src=E
    chrXVI  937875  937992  .   0   .   b2h intron  .   pri=4;src=E
    ***** WARNING: File Gallone_hints.bed has inconsistent naming convention for record:
    c2micron    1674    4928    .   0   .   b2h intron  .   pri=4;src=E

ADD REPLY • link 4.7 years ago by caro-ca ▴ 20

1

Entering edit mode

The making it chr2micron, so have the same prefix everywhere, so chr.

ADD REPLY • link 4.7 years ago by ATpoint 85k

0

Entering edit mode

Thank you for your help and time! It worked.

ADD REPLY • link 4.7 years ago by caro-ca ▴ 20