Extract a word from inside the text
4
0
Entering edit mode
6.0 years ago

Hi All Dear,

I have a text file, like the following file. I want to extract the name of the genes.

for example:

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

from below input:

ID=id18056;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XM_006073960.2;gbkey=mRNA;gene=NECTIN3;product=nectin cell adhesion molecule
ID=id18065;Parent=rna1457;Dbxref=GeneID:102398777,Genbank:XR_003108818.1;gbkey=misc_RNA;gene=TAGLN3;product=nectin cell adhesion molecule
ID=cds1149;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XP_006074022.1;Name=XP_006074022.1;gbkey=CDS;gene=SMG6;product=nectin-3;protein
ID=id18057;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XM_006073960.2;gbkey=mRNA;gene=ERICH1;product=nectin cell adhesion molecule
ID=id18066;Parent=rna1457;Dbxref=GeneID:102398777,Genbank:XR_003108818.1;gbkey=misc_RNA;gene=DLGAP2;product=nectin cell adhesion molecule
ID=cds1149;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XP_006074022.1;Name=XP_006074022.1;gbkey=CDS;gene=PPP2R2B;product=nectin-3;protein

What is the best idea?

awk R regex • 2.3k views
ADD COMMENT
3
Entering edit mode
6.0 years ago

To extract gene names:

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 
NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

To extract unique gene names

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 | sort | uniq
ADD COMMENT
0
Entering edit mode

many thanks for all answer ...

All answers were great

ADD REPLY
0
Entering edit mode

Now, I've Extract the name of the genes. But there is a problem, because a gene may be in different positions, So its name is copied several times.

Is there a suggestion?

ADD REPLY
0
Entering edit mode
sort | uniq

.

ADD REPLY
0
Entering edit mode

sorry, How to use sort | uniq?

Do you mean to add it to the previous script?

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 sort | uniq
ADD REPLY
4
Entering edit mode

mostafarafiepour, with all due respect: Invest time and search for these absolutely basic answers yourself. This is a bioinformatics Q&A community, intended to help with bioinformatics-related problems, not a basic Unix learning platform. You are lucky people actually answer these kinds of questions. Again, with respect, but if you are already stuck with these most simple things, I am worried that you will run into some severe trouble once analysis gets beyond executing basic Unix scripts. Learn the basics first, plenty of open-source material online on this.

ADD REPLY
1
Entering edit mode
6.0 years ago
ahmad mousavi ▴ 800

Hi

use these code:

# suppose df is your table
df <- gsub("*.gene=","",df)
df <- gsub("[*].*,"",df)

or make delimiter based on ** chars.

ADD COMMENT
0
Entering edit mode

I modified the text file. Before and after the gene, is not **.

ADD REPLY
1
Entering edit mode
6.0 years ago

Super fast and easy using grep pattern matching using regex

grep -P '(?<=\*\*gene=)\w+(?=\*\*)' -o gene.txt

where gene.txt if your file name

Output

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

Explanation

-P Means pattern

?<= Left Anchor

?= Right anchor

-o Output only what matched

ADD COMMENT
0
Entering edit mode

Excuse me, what is the input file? You only specify the output.

ADD REPLY
0
Entering edit mode

gene.txt is the input file. output is thrown to standard output (stdout)

ADD REPLY
0
Entering edit mode

Why do you have the ** in the positive lookbehind assertion? And why do you need the positive lookahead assertion?

grep -oP "(?<=gene=)[^;]+" will suffice, no?

EDIT: cpad is correct, we don't even need the [^;], this will suffice: grep -oP '(?<=;gene=)\w+'

EDIT2: Turns out OP changed data after posting a snippet with the **.

ADD REPLY
0
Entering edit mode

I think your script should change this way.

grep -P '(?<=\;gene=)\w+(?=\;)' -o gene.txt
ADD REPLY
0
Entering edit mode

I get the single quotes and the inclusion of a semi-colon to account for other attributes that may end in gene=, but why include a positive lookahead for a semi-colon?

Also, grep -oP <pattern> <file> is equivalent to grep -P <pattern> -o file, as neither -o not -P is a positional argument.

ADD REPLY
0
Entering edit mode

or this : grep -oP "(?<=gene=)\w+" test.txt ?

ADD REPLY
0
Entering edit mode

Or yes, this. I'd forgotten that \w does not match ;. Thanks, cpad!

ADD REPLY
0
Entering edit mode
6.0 years ago
$ sed 's/.*gene=\(\w\+\);.*/\1/g' test.txt 

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B
ADD COMMENT
1
Entering edit mode

with awk:

$ awk -F'gene=|;prod' '{print $2}' test.txt

or

$ awk 'gsub(/.*gene=|;product.*/,"")' test.txt 

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B
ADD REPLY
0
Entering edit mode

sed -r is your friend :-)

sed -r 's/.*gene=(\w+);.*/\1/g' test.txt

Although you may wish to add a ; before gene and omit the .* after the second ; :-)

ADD REPLY
1
Entering edit mode

I guess you have posted several times about -r option and I keep forgetting using it. Thanks RamRS

ADD REPLY
0
Entering edit mode

Not several, maybe just once more. Once you go -r, you never go back. It's like grep -E. So handy and convenient, makes you wonder why plain grep even exists :-)

ADD REPLY

Login before adding your answer.

Traffic: 1613 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6