bedtools intersect error?
0
0
Entering edit mode
5.8 years ago
star ▴ 350

I like to do intersect between two file using bedtools intersect. first file is the output of MACS2 (narrowpeak) that I changed it to bed format using

cat A.narrowPeak | sort -k1,1 -k2,2n | cut -f 1-4,7,9 | sed -n '/^[0-9,X]/Ip' | sed 's/^/chr/' > A.bed

and the second file is a CSV format include genome cordinate that I saved it to bed format using

write.table(b,  file="/path/b.bed", quote=F, sep="\t", row.names=F, col.names=F)

then sort it using

sort -k1,1 -k2,2n  b.bed > sorted_b.bed

Then I did intersect using

 bedtools intersect -a /path/a.bed -b /path/sorted_b.bed -wao -f 0.8 > a_b_intersect.bed

but I faced with ***** ERROR: illegal character ' ' found in integer conversion of string "10966144 ". Exiting. and my output is just contain chromosome 1.

Three first line of each original files:

A.narrowpeak :

1   1624245 1624472 GSM1554660_SRR5082154_7pcw_H3K27ac_rep1_narrow_peak_1   44  .   4.68168 8.44714 4.48990 36

 1  2143559 2143864 GSM1554660_SRR5082154_7pcw_H3K27ac_rep1_narrow_peak_3   72  .   4.36351 11.59222    7.25891 182

1   2144136 2145751 GSM1554660_SRR5082154_7pcw_H3K27ac_rep1_narrow_peak_4   165 .   8.28860 21.84743    16.59296    367

b.csv

chr start   end

chr16   86430087    86430726

chr16   80372593    80373755

chr16   78510608    78511944
ChIP-Seq bedtools intersect • 6.3k views
ADD COMMENT
1
Entering edit mode

There is a white-space behind the position 10966144.

Without knowing how A.narrowPeak or your csv looks like, it hard to say where this comes from.

ADD REPLY
0
Entering edit mode

Thanks for your reply, I updated my post. but I do not have position 10966144. A.narrowpeak is contain 12999 peak number and a.csv is contain 1555091 position.

ADD REPLY
3
Entering edit mode

but I do not have position 10966144.

What is grep telling you here? :

$ grep "10966144" /path/a.bed
$ grep "10966144" /path/sorted_b.bed

You can simply remove all white-spaces in a file with sed:

$ sed 's/ //g' input > output

Or if you want to overwrite the original file:

$ sed -i 's/ //g' input
ADD REPLY
0
Entering edit mode

Thanks it worked. but I get this error ***** ERROR: too many digits/characters for integer conversion in string . Exiting...

ADD REPLY
2
Entering edit mode

According to this post, this error appears if you have duplicate entries in one of your file. Sort your files again after you have remove the white-spaces, but this time use -u to remove duplicates:

$ sort -u -k1,1 -k2,2n input > output
ADD REPLY
0
Entering edit mode

Thanks @finswimmer, i did it but still there is the same ERROR.

ADD REPLY
1
Entering edit mode

This explanation makes it more clear what's going on here.

Your bed files are malformed. There are lines where the second or third column doesn't contain valid coordinates. This will give you the lines where the second or third column doesn't consist of one ore more numbers:

$ awk -v FS="\t" '$2 !~ /^[0-9]+$/ || $3 !~ /^[0-9]+$/' a.bed
$ awk -v FS="\t" '$2 !~ /^[0-9]+$/ || $3 !~ /^[0-9]+$/' b.bed
ADD REPLY
0
Entering edit mode

Thanks. I done it and there are some informal data in one of my files.

chr2        242335250   
chr22   9-Jan   42462950    
chr5    15-Feb  132225350
chr7    20-Mar  35932850    
chr7    23-Feb  35917850

How can i ignore them?

ADD REPLY
3
Entering edit mode

Did you open and save this file in Excel?

ADD REPLY
0
Entering edit mode

Yes, unfortunately, I have an Excel file that I think it changed all things

ADD REPLY
0
Entering edit mode

Thanks for your help, I can fix it with your help. This time I can do intersect for some data without any error but for some one I get Error: Sorted input specified, but the file /path/sorted_a.bed has the following out of order record chr10 1000824 1003242 sun_2016

ADD REPLY
1
Entering edit mode

OK, let's do some more awk-voodoo.

The following code will check for if there are white-spaces in the sequence name and if the start and end position contain only numbers. Those lines that are not valid will be written to bad.bed. All others go to good.bed, will be sorted and duplicates get removed:

$ awk -v FS="\t" '$1 ~ / / || $2 !~ /^[0-9]+$/ || $3 !~ /^[0-9]+$/ { print $0 > "bad.bed"; next; } {print $0|"sort -u -k1,1 -k2,2n -k3,3n > good.bed"}' input.bed
ADD REPLY
1
Entering edit mode

It's not the line number "position"; its one of your coordinates.

Just use grep "10966144 " on your two bed files.

ADD REPLY
0
Entering edit mode

The error message is quite clear, no? Try to read them carefully, they often tell you exactly what is wrong.

ADD REPLY

Login before adding your answer.

Traffic: 1617 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6