extracting a column by grepping particular rows
2
0
Entering edit mode
7.0 years ago
vinayjrao ▴ 260

Hi, I have two files -

file_a

gene_name sample_1 sample_2 sample_3

gene_1 count_1count_1 count_1

gene_2 count_2 count_2 count_2

gene_5 count_3 count_3 count_3

gene_6 count_4 count_4 count_4

file_b

gene_name start end gene_length

gene_1 start_1 end_1 length_1

gene_2 start_2 end_2 length_2

gene_3 start_3 end_3 length_3

gene_4 start_4 end_4 length_4

gene_5 start_5 end_5 length_5

gene_6 start_6 end_6 length_6

gene_7 start_7 end_7 length_7

I want to get the gene lengths for all the genes present in file_a.

I tried using grep, but I don't think there is a way to grep column 1 of file_a with file_b

I also tried grep by taking only the first 2 columns, but it didn't work. Is there a simpler way?

grep awk • 1.7k views
ADD COMMENT
1
Entering edit mode
7.0 years ago

output:

$ join -1 1 -2 1 test1.txt test2.txt -t $'\t'
gene_name   sample_1    sample_2    sample_3    start   end gene_length
gene_1  count_1 count_1 count_1 start_1 end_1   length_1
gene_2  count_2 count_2 count_2 start_2 end_2   length_2
gene_5  count_3 count_3 count_3 start_5 end_5   length_5
gene_6  count_4 count_4 count_4 start_6 end_6   length_6

To get gene length only:

$ join -1 1 -2 1 test1.txt test2.txt -o 1.1,1.2,1.3,1.4,2.4  -t $'\t' 
gene_name   sample_1    sample_2    sample_3    gene_length
gene_1  count_1 count_1 count_1 length_1
gene_2  count_2 count_2 count_2 length_2
gene_5  count_3 count_3 count_3 length_5
gene_6  count_4 count_4 count_4 length_6

input:

$ cat test1.txt 
gene_name   sample_1    sample_2    sample_3
gene_1  count_1 count_1 count_1
gene_2  count_2 count_2 count_2
gene_5  count_3 count_3 count_3
gene_6  count_4 count_4 count_4

$ cat test2.txt 
gene_name   start   end gene_length
gene_1  start_1 end_1   length_1
gene_2  start_2 end_2   length_2
gene_3  start_3 end_3   length_3
gene_4  start_4 end_4   length_4
gene_5  start_5 end_5   length_5
gene_6  start_6 end_6   length_6
gene_7  start_7 end_7   length_7
ADD COMMENT
0
Entering edit mode
7.0 years ago
Charles Plessy ★ 2.9k

If your data is not too large, a Unix shell command may be enough:

Here I create files containing your example data, assuming you really meant that it is space-separated.

cat > counts.txt <<_END_
gene_name sample_1 sample_2 sample_3
gene_1 count_1count_1 count_1
gene_2 count_2 count_2 count_2
gene_5 count_3 count_3 count_3
gene_6 count_4 count_4 count_4
_END_

cat > lengths.txt <<_END_
gene_name start end gene_length
gene_1 start_1 end_1 length_1
gene_2 start_2 end_2 length_2
gene_3 start_3 end_3 length_3
gene_4 start_4 end_4 length_4
gene_5 start_5 end_5 length_5
gene_6 start_6 end_6 length_6
gene_7 start_7 end_7 length_7
_END_

_(By the way, providing example data in such a way facilitates the task of those who try to answer your question. Have a try next time !)_

Then, I drop the first (header) line of your first table (sed 1d), I cut the first column, I pass it to grep via the standard input as a list of patterns to match (--file -), I specify that the matches must cover the whole word (otherwise, gene_1 would also match gene_12, etc.), I filter the second table and cut its third column, which contains the lengths.

sed 1d counts.txt |
  cut -f1 -d ' ' |
  grep --file - --word-regexp lengths.txt |
  cut -f4 -d ' '

Beware that it will all break apart if there are variations in the format of your tables.

ADD COMMENT
0
Entering edit mode

Dear Charles,

Thanks for your answer. My data is tab-separated, but I shall try the suggestion and update you.

-edit-

I tried the suggestion, but wc -l file_a gives 24424, while wc -l of the output gives 23133 lines.

file_b was downloaded from ucsc (known_gene, latest version). Could there be entries missing in it?

Thanks.

ADD REPLY
0
Entering edit mode

You can also try following command

awk '{print $1 }' file_a | grep --file - file_b | awk '{print $1"\t"$NF}'

ADD REPLY
0
Entering edit mode

Dear Suraj,

Thanks for the suggestion, but could you please explain what the command does, because wc -l gave me 36184, while wc -l file_a gives 24424.

ADD REPLY
0
Entering edit mode

Dear vinayjrao, Can you please upload the part of file and share link with me.

ADD REPLY
0
Entering edit mode

Do you see the same numbers when following the very good answer of cpad0112 ?

ADD REPLY

Login before adding your answer.

Traffic: 2754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6