Question

extracting a column by grepping particular rows

0

Entering edit mode

7.3 years ago

vinayjrao ▴ 260

Hi, I have two files -

file_a

gene_name sample_1 sample_2 sample_3

gene_1 count_1count_1 count_1

gene_2 count_2 count_2 count_2

gene_5 count_3 count_3 count_3

gene_6 count_4 count_4 count_4

file_b

gene_name start end gene_length

gene_1 start_1 end_1 length_1

gene_2 start_2 end_2 length_2

gene_3 start_3 end_3 length_3

gene_4 start_4 end_4 length_4

gene_5 start_5 end_5 length_5

gene_6 start_6 end_6 length_6

gene_7 start_7 end_7 length_7

I want to get the gene lengths for all the genes present in file_a.

I tried using grep, but I don't think there is a way to grep column 1 of file_a with file_b

I also tried grep by taking only the first 2 columns, but it didn't work. Is there a simpler way?

grep awk • 1.9k views

ADD COMMENT • link updated 7.3 years ago by cpad0112 21k • written 7.3 years ago by vinayjrao ▴ 260

score 1 · Answer 1 · 2017-12-26

output:

$ join -1 1 -2 1 test1.txt test2.txt -t $'\t'
gene_name   sample_1    sample_2    sample_3    start   end gene_length
gene_1  count_1 count_1 count_1 start_1 end_1   length_1
gene_2  count_2 count_2 count_2 start_2 end_2   length_2
gene_5  count_3 count_3 count_3 start_5 end_5   length_5
gene_6  count_4 count_4 count_4 start_6 end_6   length_6

To get gene length only:

$ join -1 1 -2 1 test1.txt test2.txt -o 1.1,1.2,1.3,1.4,2.4  -t $'\t' 
gene_name   sample_1    sample_2    sample_3    gene_length
gene_1  count_1 count_1 count_1 length_1
gene_2  count_2 count_2 count_2 length_2
gene_5  count_3 count_3 count_3 length_5
gene_6  count_4 count_4 count_4 length_6

input:

$ cat test1.txt 
gene_name   sample_1    sample_2    sample_3
gene_1  count_1 count_1 count_1
gene_2  count_2 count_2 count_2
gene_5  count_3 count_3 count_3
gene_6  count_4 count_4 count_4

$ cat test2.txt 
gene_name   start   end gene_length
gene_1  start_1 end_1   length_1
gene_2  start_2 end_2   length_2
gene_3  start_3 end_3   length_3
gene_4  start_4 end_4   length_4
gene_5  start_5 end_5   length_5
gene_6  start_6 end_6   length_6
gene_7  start_7 end_7   length_7

score 0 · Answer 2 · 2017-12-26

0

Entering edit mode

7.3 years ago

Charles Plessy ★ 2.9k

If your data is not too large, a Unix shell command may be enough:

Here I create files containing your example data, assuming you really meant that it is space-separated.

cat > counts.txt <<_END_
gene_name sample_1 sample_2 sample_3
gene_1 count_1count_1 count_1
gene_2 count_2 count_2 count_2
gene_5 count_3 count_3 count_3
gene_6 count_4 count_4 count_4
_END_

cat > lengths.txt <<_END_
gene_name start end gene_length
gene_1 start_1 end_1 length_1
gene_2 start_2 end_2 length_2
gene_3 start_3 end_3 length_3
gene_4 start_4 end_4 length_4
gene_5 start_5 end_5 length_5
gene_6 start_6 end_6 length_6
gene_7 start_7 end_7 length_7
_END_

_(By the way, providing example data in such a way facilitates the task of those who try to answer your question. Have a try next time !)_

Then, I drop the first (header) line of your first table (sed 1d), I cut the first column, I pass it to grep via the standard input as a list of patterns to match (--file -), I specify that the matches must cover the whole word (otherwise, gene_1 would also match gene_12, etc.), I filter the second table and cut its third column, which contains the lengths.

sed 1d counts.txt |
  cut -f1 -d ' ' |
  grep --file - --word-regexp lengths.txt |
  cut -f4 -d ' '

Beware that it will all break apart if there are variations in the format of your tables.

ADD COMMENT • link 7.3 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Dear Charles,

Thanks for your answer. My data is tab-separated, but I shall try the suggestion and update you.

-edit-

I tried the suggestion, but wc -l file_a gives 24424, while wc -l of the output gives 23133 lines.

file_b was downloaded from ucsc (known_gene, latest version). Could there be entries missing in it?

Thanks.

ADD REPLY • link 7.3 years ago by vinayjrao ▴ 260

0

Entering edit mode

You can also try following command

awk '{print $1 }' file_a | grep --file - file_b | awk '{print $1"\t"$NF}'

ADD REPLY • link 7.3 years ago by MSM55 ▴ 160

0

Entering edit mode

Dear Suraj,

Thanks for the suggestion, but could you please explain what the command does, because wc -l gave me 36184, while wc -l file_a gives 24424.