Extract genes names from GOrilla output file
1
0
Entering edit mode
4.4 years ago
rrapopor ▴ 40

Hello, I got a file with GOrilla results. I don;t have the input data used to create the file. This is the file:

GO Term Description     P-value FDR q-value     Enrichment      N       B       n       b       Genes
GO:0001580      detection of chemical stimulus involved in sensory perception of bitter taste   4.40E-20        6.67E-16   10.14   20504   48      1053    25      [Tas2r105  -  taste receptor, type 2, member 105, Tas2r119  -  taste receptor, type 2, member 119, Tas2r136  -  taste receptor, type 2, member 136, Tas2r122  -  taste receptor, type 2, member 122, Tas2r117  -  taste receptor, type 2, member 117, Tas2r123  -  taste receptor, type 2, member 123, Tas2r115  -  taste receptor, type 2, member 115, Tas2r129  -  taste receptor, type 2, member 129, Tas2r130  -  taste receptor, type 2, member 130, Tas2r125  -  taste receptor, type 2, member 125, Tas2r121  -  taste receptor, type 2, member 121, Tas2r124  -  taste receptor, type 2, member 124, Tas2r120  -  taste receptor, type 2, member 120, Tas2r113  -  taste receptor, type 2, member 113, Tas2r114  -  taste receptor, type 2, member 114, Tas2r109  -  taste receptor, type 2, member 109, Tas2r110  -  taste receptor, type 2, member 110, Tas2r106  -  taste receptor, type 2, member 106, Tas2r107  -  taste receptor, type 2, member 107, Tas2r102  -  taste receptor, type 2, member 102, Tas2r116  -  taste receptor, type 2, member 116, Tas2r104  -  taste receptor, type 2, member 104, Tas2r103  -  taste receptor, type 2, member 103, Tas2r131  -  taste receptor, type 2, member 131, Tas2r140  -  taste receptor, type 2, member 140]

I want to get a list with genes only by extract from the brackets in the 11th column the genes name. For example:

Tas2r105
Tas2r119
Tas2r117
...

I tried the code :

awk -F'[][]' '{print $2}' gorilla_master_nox.txt | grep -oP '(?<=,).*?(?=-)'

But I dont get the wanted results. I would appreciate any help.

Thank you

gO gene sed awk grep • 1.3k views
ADD COMMENT
1
Entering edit mode

If gene nomenclature pattern is fixed, you can use:

$ grep -Po '\w{3}\d\w\d{3}' test.txt 

Tas2r105
Tas2r119
Tas2r136
Tas2r122
Tas2r117
...

If you want to tighten the expression, you can use [A-Z][a-z]{2}[0-9][a-z][0-9]{3}. However, if you are not sure of gene pattern, please use cut -f11 test.txt| grep -Po '[A-Z][a-z]{2}[0-9][a-z][0-9]{3}'

ADD REPLY
0
Entering edit mode

Thanks! The genes nomenclature pattern is not fixed, but the cut -f11 didn't work.

ADD REPLY
1
Entering edit mode
4.4 years ago

Here are a grep (with Perl) and pure Perl solution If you want to get all genes within the brackets for each line. I sort the results and only keep one of the unique values using sort -u so feel free to remove that part if you don't want that behavior.

grep with Perl enabled.

grep -oP '\w+(?=\s{2}-)' gorilla_master_nox.txt | sort -u

A (minutely slower) Perl one liner.

perl -nle 'print "$1" while /(\w+)\s{2}-/g' gorilla_master_nox.txt | sort -u

The first 5 genes of the results using either method.

Tas2r102
Tas2r103
Tas2r104
Tas2r105
Tas2r106
ADD COMMENT
1
Entering edit mode

I am aware that first part of grep in OP tightens the expression. However, double space,hyphen,double space might be unique in file. Given that, grep expression can be modified.

$ grep -oP '\w+(?=\s{2}-\s{2})' test.txt 
Tas2r105
Tas2r119
Tas2r136
Tas2r122
Tas2r117
Tas2r123
...

If it is not unique, we can use cut -f11 test.txt| grep -oP '\w+(?=\s{2}-\s{2})

ADD REPLY
0
Entering edit mode

Thanks, to reduce it even more double space-hyphen-double space gave me the same results as double space-hyphen on the example data, so I used the later.

ADD REPLY
0
Entering edit mode

Thank you! For the entire file I got different results for the grep and the perl solutions. The perl solution worked perfect for me:)

ADD REPLY

Login before adding your answer.

Traffic: 1114 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6