how to grep word with hyphen/dash of A file in one column of B file
6
0
Entering edit mode
7.0 years ago
Ming Lu ▴ 30

Hi, I have a A.bed file only with the gene name of these

  chr1-1
  chr1-10
  chr1-102
  chr1-106
  chr1-11
  chr1-2
  chr1-3

and I know they also in one column of B.bed .

chr1 startpos endpos chr1-1
chr1 startpos endpos chr1-10
chr1 startpos endpos chr1-102
chr1 startpos endpos chr1-106
chr1 startpos endpos chr1-11
chr1 startpos endpos chr1-2
chr1 startpos endpos chr1-3
chr2 startpos endpos chr2-234
chr12 startpos endpos chr12-23546

However, why

  cut -f4 B.bed > C.bed # only use the gene name column
  comm -1 -2 A.bed C.bed

find all of them, But

  grep -w -f A.bed B.bed

only find

  chr1-1
  chr1-2
  chr1-3

Because comm cannot show whole rows in B.bed.

How could I use grep to call all the matched rows in B.bed?

Or how could I call all the rows in B.bed file with matched words of one column using another file?

ChIP-Seq • 4.4k views
ADD COMMENT
0
Entering edit mode

Are the A and C files sorted?

comm -1 -2 <(sort A.bed) <(sort C.bed)
ADD REPLY
0
Entering edit mode

yes sorte have sorted, comm is right, grep cannot get right number

ADD REPLY
2
Entering edit mode

Sorry, should have read it better. Are there special characters in one of the files?

head A.bed | sed -n 'l'
head B.bed | sed -n 'l'

Have you tried the join command?

join -1 4 -2 1 B.bed A.bed
ADD REPLY
2
Entering edit mode
7.0 years ago
Ming Lu ▴ 30

find a good code can get all matched lines:654! without need of sorting first. `

 awk -F '\t' 'NR==FNR{a[$1]=$1;next}; ($1==a[$1]){print $0}' a.bed b.bed > new.bed

in b.bed 's order with b.bed's columns

 awk -F '\t' 'NR==FNR{a[$1]=$0;next}; ($1 in a){print a[$1]}' a.bed b.bed > new.bed

in b.bed 's order with a.bed's columns

ADD COMMENT
1
Entering edit mode
7.0 years ago
mittu1602 ▴ 200

If its ok for you to use awk, use the following command:

awk 'FNR==NR{a[$1]=$4;next}{if(a[$1]==""){a[$1]=0};printf "%s%s%s%s%s%s%s%s%s\n",$1,FS,$2,FS,$3,FS,$4,FS,a[$1]}' B.bed A.bed  > result1
ADD COMMENT
1
Entering edit mode
7.0 years ago

output:

$ grep -w -f ids.txt test.txt 
chr1    startpos    endpos  chr1-1
chr1    startpos    endpos  chr1-10
chr1    startpos    endpos  chr1-102
chr1    startpos    endpos  chr1-106
chr1    startpos    endpos  chr1-11
chr1    startpos    endpos  chr1-2
chr1    startpos    endpos  chr1-3

$ join  -1 1 -2 4 ids.txt test.txt 
chr1-1 chr1 startpos endpos
chr1-10 chr1 startpos endpos
chr1-102 chr1 startpos endpos
chr1-106 chr1 startpos endpos
chr1-11 chr1 startpos endpos
chr1-2 chr1 startpos endpos
chr1-3 chr1 startpos endpos

input:

$ cat ids.txt 
chr1-1
chr1-10
chr1-102
chr1-106
chr1-11
chr1-2
chr1-3

$ cat test.txt 
chr1    startpos    endpos  chr1-1
chr1    startpos    endpos  chr1-10
chr1    startpos    endpos  chr1-102
chr1    startpos    endpos  chr1-106
chr1    startpos    endpos  chr1-11
chr1    startpos    endpos  chr1-2
chr1    startpos    endpos  chr1-3
chr2    startpos    endpos  chr2-234
chr12   startpos    endpos  chr12-23546
ADD COMMENT
1
Entering edit mode

You can modify the join output with

join -1 1 -2 4 -o 2.1,2.2,2.3,0 ids.txt test.txt | tr ' ' '\t'

The tr command replaces the standard white-space with a tab.

ADD REPLY
1
Entering edit mode

Join supports tsv output natively. output from $ join -t $'\t' -1 1 -2 4 -o 2.1,2.2,2.3,0 ids.txt test.txt is = join -1 1 -2 4 -o 2.1,2.2,2.3,0 ids.txt test.txt | tr ' ' '\t'

ADD REPLY
0
Entering edit mode
7.0 years ago

Hi, Are the number of rows equal in both the files ? Try grep -Fwf A.bed B.bed > Output.txt

ADD COMMENT
0
Entering edit mode

not equal, A.bed has 645 rows, B.bed has 33024 rows. But all A.bed are from one column of B.bed.

I think maybe "-"dash break the -w limited string?

Tried your code, still cannot find the rest same gene with grep -Fwf

ADD REPLY
0
Entering edit mode

In your command "comm -1 -3 A.bed C.bed"

-1 will suppress column 1 (lines unique to FILE 1) -3 will suppress column 3 (lines that appear in both files)

When using -3 , you are actually suppressing the lines that match in A.bed and B.bed.

Please try using "comm -1 -2 A.bed B.bed"

ADD REPLY
0
Entering edit mode

just writing mistake not the focus.

ADD REPLY
0
Entering edit mode
7.0 years ago
EagleEye 7.6k
grep -w -Ff File2.txt File1.txt > commonFile1File2.txt
ADD COMMENT
0
Entering edit mode
7.0 years ago
Ming Lu ▴ 30

Firstly, I change all "-" to "_", and only use the column I use for grep, but make no difference.

All 654 rows of moVDR1220 should be in 36551 rows of trytry.txt

as moVDR1220.txt is a result of

#first transform enhancer.txt to enhancer.bed (move name column such as chr1-10 from 1 to 4 )
#then
bedtools intersect -a enhancer.bed -b BBB.bed -wa | cut -f4 > moVDR1220.txt

and trytry.txt is the result of ( the wc -l of enhancer.txt, enhancer.bed, trytry.txt, trytry.cdt all 36551)

annotatePeaks.pl enhancer.txt hg19 -size 2000 -hist 10 -ghist -d 24hvitd/ 24heth/ > trytry.txt.
more trytry.txt|cut -f1> trytry.txt

so the grep or join or comm result should all be 654.

my data is:

homer $ more moVDR1220.txt|head
chr1_1
chr1_10
chr1_102
chr1_106
chr1_11
chr1_1140
chr1_115
chr1_12
chr1_123
chr1_14
homer$ more trytry.txt|head
Gene
chr1_1
chr1_10
chr1_100
chr1_1000
chr1_10000
chr1_10025
chr1_10028
chr1_10031
chr1_10037
homer$ grep -w -f moVDR1220.txt trytry.txt | wc -l
 180 
homer$ grep -w -f moVDR1220.txt trytry.txt | head
chr1_1
chr1_2
chr1_3
chr1_4 
chr1_5
chr1_6
chr1_75
chr1_76
chr1_8
chr1_9
homer$ join -1 1 -2 1 moVDR1220.txt trytry.txt | wc -l
 389
homer$ join -1 1 -2 1 moVDR1220.txt trytry.txt | head
chr1_1
chr1_10
chr1_102
chr1_106
chr1_11
chr1_1140
chr1_115
chr1_12
chr1_123
chr1_14
homer$ comm -1 -2 moVDR1220.txt trytry.txt| wc -l
 389

I know the problem now "-" didn't impact, a mistake in bedtools step.

But I still don;t know why grep cannot do this kind of thing.

ADD COMMENT

Login before adding your answer.

Traffic: 2062 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6