Question

condition bedtools on name column

0

Entering edit mode

8.8 years ago

burnsro ▴ 20

Bedtools accepts files with formats like 'chr start end name score strand'

However, is it possible to take these columns into account (e.g name column) when using bedtools intersect or bedtools merge.

For example I have a bed file like so, with a 'name' column for transposable elements

FileA:

scaffold_1  913038  914038  DNA_transposon  
scaffold_1  925018  926018  LTR

File B

scaffold_1  1026586 1027586 Mariner 
scaffold_1  925211  935660  DNA_transposon  
scaffold_1  925880  926300  LTR

but I would only like bedtools to intersect the 2nd line from FileA with the 3rd line from FileB, ignoring the 2nd line in FileB because it has "DNA_transposon" and not "LTR" in column4.

I can't find something like this in the manual file for bedtools. Knowing how to do it in bedtools would be helpful because it has the option to compare one bed file with many others (options -a and -b), in addition to many other options.

bedtools • 4.4k views

ADD COMMENT • link updated 8.8 years ago by Alex Reynolds 36k • written 8.8 years ago by burnsro ▴ 20

score 0 · Answer 1 · 2016-08-02

0

Entering edit mode

8.8 years ago

EagleEye 7.6k

1) Why don't you grep 'LTR' lines from both the files

grep -w "LTR" FileA.bed > FileA_greppedLTR.bed

grep -w "LTR" FileB.bed > FileB_greppedLTR.bed

2) Use bed tools on grepped files

ADD COMMENT • link 8.8 years ago by EagleEye 7.6k

0

Entering edit mode

True, it would just be a lot of "grepped" files to make for each unique name in column4. This example I made is just simplification.

ADD REPLY • link 8.8 years ago by burnsro ▴ 20

0

Entering edit mode

EDIT: (The -F is the delimiter but it's not needed for tab-separate, see below). While this doesn't help not make lots of files, you could just do awk -F '{print > "file_"$4}' file.bed and that'll make you a new file for each unique string in that column, then you've not got to spit each on individually.

ADD REPLY • link 8.8 years ago by Daniel ★ 4.0k

0

Entering edit mode

By the way. I need some clarification, the AWK command you mentioned does it actually work ?

ADD REPLY • link 8.8 years ago by EagleEye 7.6k

0

Entering edit mode

oops, I just realised I removed the file delimiter but not the flag (you don't need the -F if it's tab-separated). But other than that, yeah it works as this:

awk '{print > "file_"$4}' file.bed

ADD REPLY • link 8.8 years ago by Daniel ★ 4.0k

0

Entering edit mode

Great Awesome.......

ADD REPLY • link 8.8 years ago by EagleEye 7.6k

score 0 · Answer 2 · 2016-08-02

0

Entering edit mode

8.8 years ago

Pierre Lindenbaum 166k

using sqlite3

$ awk 'BEGIN {printf("create table if not exists TA (C TEXT,S int,E int, T text); BEGIN TRANSACTION;\n");} { printf("insert into TA(C,S,E,T) values (\"%s\",%s,%s,\"%s\");\n",$1,$2,$3,$4);} END {printf("commit;\n");}' fileA | sqlite3 db.sqlite
$ awk 'BEGIN {printf("create table if not exists TB (C TEXT,S int,E int, T text); BEGIN TRANSACTION;\n");} { printf("insert into TB(C,S,E,T) values (\"%s\",%s,%s,\"%s\");\n",$1,$2,$3,$4);} END {printf("commit;\n");}' fileB | sqlite3 db.sqlite
$ sqlite3 db.sqlite 'select * from TA,TB where TA.C=TB.C and TA.T=TB.T and NOT(TA.E <= TB.S or TB.E <= TA.S)'

ADD COMMENT • link 8.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

It would work. I am only hesitant to use something other than bedtools because bedtools has many useful option such as -c to count the number of files that has an interval, -f for matching only a percentage of an interval, -wa and -wb etc.

ADD REPLY • link 8.8 years ago by burnsro ▴ 20

score 0 · Answer 3 · 2016-08-02

0

Entering edit mode

8.8 years ago

EagleEye 7.6k

Sorry I could not able to add as a reply to your comment because of the format of the message. I am writing it in the answer section

Example input (BED tab-delimited):

scaffold_1 1026586 1027586 Mariner

scaffold_1 925211 935660 DNA_transposon

scaffold_1 925880 926300 LTR

scaffold_1 11111 22324 DNA_transposon

scaffold_1 1026586 1027586 Mariner

scaffold_1 21432 21444 Mariner

SCRIPT ( splitFile.sh ) :

file=$1

cols=$(cat $file | cut -f4 | awk '!x[$0]++' | tr '\n' ' ' | sed -e 's/(.*) /\1/g');

for name in $cols

{

IFS='/' read -a file_name <<< "$file";

fname=$(echo $file_name | cut -f 1 -d '.')

grep -w "$name" $file > ${fname}_${name}.bed

}

Run script:

./splitFile.sh FileB.bed

Output files:

1) FileB_DNA_transposon.bed

scaffold_1 925211 935660 DNA_transposon

scaffold_1 11111 22324 DNA_transposon

2) FileB_LTR.bed

scaffold_1 925880 926300 LTR

3) FileB_Mariner.bed

scaffold_1 1026586 1027586 Mariner

scaffold_1 21432 21444 Mariner

Run this script for FileA.bed and then compare appropriate pairs using BedTools. This is just a suggestion (idea), you can modify according to your needs.

ADD COMMENT • link 8.8 years ago by EagleEye 7.6k

0

Entering edit mode

Yes I understood what you meant by using grep in the first comment, but it would lead to a lot of pairwise comparisons to make especially with many bed files to compare. With bedtools you can use one file to compare with many at once, thats why I wrote it in the question.

ADD REPLY • link 8.8 years ago by burnsro ▴ 20

0

Entering edit mode

Now the question is not clear. You wanted to compare File A and File B rows having 'same/matching' column#4 using bedtools, that's what I understood.

ADD REPLY • link 8.8 years ago by EagleEye 7.6k

0

Entering edit mode

File A and File B are a general example, but ideally I'd like to scale up to many FileAs and FileBs. I agree your way works its just a lot of intermediates to generate.

ADD REPLY • link 8.8 years ago by burnsro ▴ 20

score 0 · Answer 4 · 2016-08-02

Generating intermediate files is unnecessary. Doing pairwise comparisons is unnecessary.

The bedops toolkit supports multiple standard input streams, so for one "condition" or ID field, you can use a process substitution to generate a stream of subsetted elements, wherever you would have otherwise specified a path to a regular file ("filename"):

$ bedops --whatever-set-operation <(awk '$4 == "DNA_transposon"' A.bed) <(awk '$4 == "DNA_transposon"' B.bed) > answer.DNA_transposon.bed

More generally:

$ id="DNA_transposon" && bedops --whatever-set-operation <(awk -vid=$id '$4==id' A.bed) <(awk -vid=$id '$4==id' B.bed) > answer.$id.bed

From here, it is easy to deal with automating the even more general case of multiple IDs.

Just set up a loop that walks through a list of unique IDs from one of the inputs via cut, sort anduniq:

$ for id in `cut -f4 A.bed | sort | uniq`; do bedops --whatever-set-operation <(awk -vid=$id '$4==id' A.bed) <(awk -vid=$id '$4==id' B.bed) > answer.$id.bed; done

If you want a list of IDs from two or more files, just use cat to concatenate them to build the ID list:

$ for id in `cat A.bed B.bed | cut -f4 | sort | uniq`; do bedops --whatever-set-operation <(awk -vid=$id '$4==id' A.bed) <(awk -vid=$id '$4==id' B.bed) > answer.$id.bed; done

Thanks to default support for Unix input streams in bedops tools and to functionality in the bash shell, all of these approaches generate no intermediate (regular) files and should run pretty quickly.