Entering edit mode
6.2 years ago
dzisis1986
▴
70
I used bedtools intersect like that
bedtools intersect -a test.bed -b fragments.bed -wa -wb -loj > test.txt
The fragments file is a file with 3 columns: chr,start,end position the test.bed is a file wtih 4 columns with counts : chr,start,end,count
I used intersect of bedtools and the result is something like that
chr19 61030687 61040876 0 -1 -1 0
chr19 61040883 61041418 0 -1 -1 0
chr19 61041425 61041896 0 -1 -1 0
chr19 61041903 61042676 0 -1 -1 0
chr19 61042683 61044693 0 -1 -1 0
chr19 61044700 61045007 0 -1 -1 0
chr19 61045014 61048846 0 -1 -1 0
chr19 61048853 61051147 0 -1 -1 0
chr19 61051154 61051161 0 -1 -1 0
chr19 61051168 61055066 0 -1 -1 0
chr19 61055073 61059079 chr19 61057150 61059534 3
chr19 61059086 61065281 chr19 61057150 61059534 3
chr19 61065288 61065491 0 -1 -1 0
chr19 61065498 61069950 0 -1 -1 0
chr19 61069957 61070313 0 -1 -1 0
chr19 61070320 61071203 0 -1 -1 0
chr19 61071210 61074042 0 -1 -1 0
chr19 61074049 61076962 0 -1 -1 0
chr19 61076969 61078370 0 -1 -1 0
chr19 61078377 61084129 chr19 61079739 61080558 10
chr19 61084136 61085306 0 -1 -1 0
chr19 61110306 61112208 0 -1 -1 0
chr19 61130752 61131999 0 -1 -1 0
chr19 61132006 61139461 0 -1 -1 0
chr19 61139468 61142499 0 -1 -1 0
chr19 61142506 61144492 0 -1 -1 0
chr19 61144499 61144577 0 -1 -1 0
chr19 61144584 61147571 chr19 61146043 61148013 8
chr19 61147578 61147680 chr19 61146043 61148013 8
chr19 61147687 61148346 chr19 61146043 61148013 8
chr19 61148353 61149397 0 -1 -1 0
chr19 61149404 61149653 0 -1 -1 0
chr19 61149660 61150034 0 -1 -1 0
This is not correct because there are dublicate counts that are not in the original file. i would like to filter it in order to have something like that :
chr19 61030687 61040876 0 -1 -1 0
chr19 61040883 61041418 0 -1 -1 0
chr19 61041425 61041896 0 -1 -1 0
chr19 61041903 61042676 0 -1 -1 0
chr19 61042683 61044693 0 -1 -1 0
chr19 61044700 61045007 0 -1 -1 0
chr19 61045014 61048846 0 -1 -1 0
chr19 61048853 61051147 0 -1 -1 0
chr19 61051154 61051161 0 -1 -1 0
chr19 61051168 61055066 0 -1 -1 0
chr19 61055073 61059079 chr19 61057150 61059534 3
chr19 61059086 61065281 0 -1 -1 0
chr19 61065288 61065491 0 -1 -1 0
chr19 61065498 61069950 0 -1 -1 0
chr19 61069957 61070313 0 -1 -1 0
chr19 61070320 61071203 0 -1 -1 0
chr19 61071210 61074042 0 -1 -1 0
chr19 61074049 61076962 0 -1 -1 0
chr19 61076969 61078370 0 -1 -1 0
chr19 61078377 61084129 chr19 61079739 61080558 10
chr19 61084136 61085306 0 -1 -1 0
chr19 61110306 61112208 0 -1 -1 0
chr19 61130752 61131999 0 -1 -1 0
chr19 61132006 61139461 0 -1 -1 0
chr19 61139468 61142499 0 -1 -1 0
chr19 61142506 61144492 0 -1 -1 0
chr19 61144499 61144577 0 -1 -1 0
chr19 61144584 61147571 chr19 61146043 61148013 8
chr19 61147578 61147680 0 -1 -1 0
chr19 61147687 61148346 0 -1 -1 0
chr19 61148353 61149397 0 -1 -1 0
chr19 61149404 61149653 0 -1 -1 0
chr19 61149660 61150034 0 -1 -1 0
Any help to do it in R or python ? Thanks
Do you have any experience in R or Python to try something by yourself ?
YEs i do. i tried some manipulations but i cant see how to keep unfiltered the first 3 columns and filter only the 4-6 but also keep the rest with 0 and -1 as it is !
Edit your post and show us what you did please, even if it does not work
What do you mean by :
This is correct : The 2 fragments below overlap with the test
61057150 < 61059079 < 61059534
and61057150 < 61059086 < 61059534
are in fragments.bed
is in test.bed
For this specific result what do you want as output ?
I suggest you specify the ultimate goal and optimize the bedtools command. As Bastien has pointed out, there's no error in the bedtools output and I have the feeling that using a different set of bedtools options may get you what you want. In order to help with that, we would need to know what exactly it is you're looking for and how you would decide what the "wrong" output line was.
For example this one -- which line would you keep and why?
Hello dzisis1986 ,
please show us the exact command you used and how your input files looks like.
Thanks.
fin swimmer
Could you also paste the first lines of test.bed and fragments.bed please
test.bed can be a file like that:
and fragments.bed is a file like that :
Probably because a single record from test.bed overlaps with two ranges in fragments.bed. For eg. record from test.bed (a single line:
chr19 61057150 61059534 3)
overlaps with two ranges in fragments.bed (two lines:chr19 61057150 61059534 3
andchr19 61059086 61065281
. Hence you would see duplicate lines like this (which are not, in fact):or I got your issue wrong.
Yes this is the reason but i dont want to have this duplicate that why i want to remove it manually after intersect,
what is the expected output? dzisis1986. Like this:
instead of:
If you don't want to lose the information and at the same time remove duplicate lines (supposed to be), you can do this:
I would go with this in case you need to reconstruct in future:
For any reason, if you want to count, how many ranges each record overlaps, you can use following code:
ps: posted as a subpost as the previous post was getting bigger and confusing.
You want a single fragment in each test bin ?
If you do somehting like this :
You will lost information about this fragment :
If you just want to remove duplicates, sort followed by uniq will do the job in linux.
Eg.
sorry misunderstood the question