Question

Extract a word from inside the text

0

Entering edit mode

6.1 years ago

mostafarafiepour ▴ 180

Hi All Dear,

I have a text file, like the following file. I want to extract the name of the genes.

for example:

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

from below input:

ID=id18056;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XM_006073960.2;gbkey=mRNA;gene=NECTIN3;product=nectin cell adhesion molecule
ID=id18065;Parent=rna1457;Dbxref=GeneID:102398777,Genbank:XR_003108818.1;gbkey=misc_RNA;gene=TAGLN3;product=nectin cell adhesion molecule
ID=cds1149;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XP_006074022.1;Name=XP_006074022.1;gbkey=CDS;gene=SMG6;product=nectin-3;protein
ID=id18057;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XM_006073960.2;gbkey=mRNA;gene=ERICH1;product=nectin cell adhesion molecule
ID=id18066;Parent=rna1457;Dbxref=GeneID:102398777,Genbank:XR_003108818.1;gbkey=misc_RNA;gene=DLGAP2;product=nectin cell adhesion molecule
ID=cds1149;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XP_006074022.1;Name=XP_006074022.1;gbkey=CDS;gene=PPP2R2B;product=nectin-3;protein

What is the best idea?

awk R regex • 2.4k views

ADD COMMENT • link updated 6.1 years ago by cpad0112 21k • written 6.1 years ago by mostafarafiepour ▴ 180

0

Entering edit mode

6.1 years ago

cpad0112 21k

$ sed 's/.*gene=\(\w\+\);.*/\1/g' test.txt 

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

ADD COMMENT • link 6.1 years ago by cpad0112 21k

1

Entering edit mode

with awk:

$ awk -F'gene=|;prod' '{print $2}' test.txt

or

$ awk 'gsub(/.*gene=|;product.*/,"")' test.txt 

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

ADD REPLY • link 6.1 years ago by cpad0112 21k

0

Entering edit mode

sed -r is your friend :-)

sed -r 's/.*gene=(\w+);.*/\1/g' test.txt

Although you may wish to add a ; before gene and omit the .* after the second ; :-)

ADD REPLY • link 6.1 years ago by Ram 44k

1

Entering edit mode

I guess you have posted several times about -r option and I keep forgetting using it. Thanks RamRS

ADD REPLY • link 6.1 years ago by cpad0112 21k

0

Entering edit mode

Not several, maybe just once more. Once you go -r, you never go back. It's like grep -E. So handy and convenient, makes you wonder why plain grep even exists :-)

ADD REPLY • link 6.1 years ago by Ram 44k

zx8754 · Accepted Answer · 2018-11-16

3

Entering edit mode

6.1 years ago

Pierre Lindenbaum 164k

To extract gene names:

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 
NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

To extract unique gene names

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 | sort | uniq

ADD COMMENT • link updated 6.1 years ago by zx8754 12k • written 6.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

many thanks for all answer ...

All answers were great

ADD REPLY • link 6.1 years ago by mostafarafiepour ▴ 180

0

Entering edit mode

Now, I've Extract the name of the genes. But there is a problem, because a gene may be in different positions, So its name is copied several times.

Is there a suggestion?

ADD REPLY • link 6.1 years ago by mostafarafiepour ▴ 180

0

Entering edit mode

sort | uniq

.

ADD REPLY • link 6.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

sorry, How to use sort | uniq?

Do you mean to add it to the previous script?

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 sort | uniq

ADD REPLY • link 6.1 years ago by mostafarafiepour ▴ 180

4

Entering edit mode

mostafarafiepour, with all due respect: Invest time and search for these absolutely basic answers yourself. This is a bioinformatics Q&A community, intended to help with bioinformatics-related problems, not a basic Unix learning platform. You are lucky people actually answer these kinds of questions. Again, with respect, but if you are already stuck with these most simple things, I am worried that you will run into some severe trouble once analysis gets beyond executing basic Unix scripts. Learn the basics first, plenty of open-source material online on this.

ADD REPLY • link 6.1 years ago by ATpoint 86k

score 1 · Accepted Answer · 2018-11-16

1

Entering edit mode

6.1 years ago

ahmad mousavi ▴ 800

Hi

use these code:

# suppose df is your table
df <- gsub("*.gene=","",df)
df <- gsub("[*].*,"",df)

or make delimiter based on ** chars.

ADD COMMENT • link 6.1 years ago by ahmad mousavi ▴ 800

0

Entering edit mode

I modified the text file. Before and after the gene, is not **.

ADD REPLY • link 6.1 years ago by mostafarafiepour ▴ 180

score 1 · Accepted Answer · 2018-11-16

1

Entering edit mode

6.1 years ago

lakhujanivijay 5.9k

Super fast and easy using grep pattern matching using regex

grep -P '(?<=\*\*gene=)\w+(?=\*\*)' -o gene.txt

where gene.txt if your file name

Output

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

Explanation

-P Means pattern

?<= Left Anchor

?= Right anchor

-o Output only what matched

ADD COMMENT • link 6.1 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Excuse me, what is the input file? You only specify the output.

ADD REPLY • link 6.1 years ago by mostafarafiepour ▴ 180

0

Entering edit mode

gene.txt is the input file. output is thrown to standard output (stdout)

ADD REPLY • link 6.1 years ago by lakhujanivijay 5.9k

0

Entering edit mode

~~Why do you have the ** in the positive lookbehind assertion?~~ And why do you need the positive lookahead assertion?

grep -oP "(?<=gene=)[^;]+" will suffice, no?

EDIT: cpad is correct, we don't even need the [^;], this will suffice: grep -oP '(?<=;gene=)\w+'

EDIT2: Turns out OP changed data after posting a snippet with the **.

ADD REPLY • link 6.1 years ago by Ram 44k

0

Entering edit mode

I think your script should change this way.

grep -P '(?<=\;gene=)\w+(?=\;)' -o gene.txt

ADD REPLY • link 6.1 years ago by mostafarafiepour ▴ 180

0

Entering edit mode

I get the single quotes and the inclusion of a semi-colon to account for other attributes that may end in gene=, but why include a positive lookahead for a semi-colon?

Also, grep -oP <pattern> <file> is equivalent to grep -P <pattern> -o file, as neither -o not -P is a positional argument.

ADD REPLY • link 6.1 years ago by Ram 44k

0

Entering edit mode

or this : grep -oP "(?<=gene=)\w+" test.txt ?

ADD REPLY • link 6.1 years ago by cpad0112 21k

0

Entering edit mode

Or yes, this. I'd forgotten that \w does not match ;. Thanks, cpad!

ADD REPLY • link 6.1 years ago by Ram 44k