Question

A script for extracting information related to a list of gene names from a file

0

Entering edit mode

8.8 years ago

ahmad.iut ▴ 90

Dear Biostars,

I have a text file containing several rows and columns like this:

"Gene Name"      "Gene Id"       "description"         "GO"
A                1               phosphatase           GO:001256
B                2               synthesize            GO:013154
C                3               methylase             GO:000054
D                4               kinase                GO:001254
E                5               oxigenase             GO:001354
F                6               synthesize            GO:001254

In addition, I have another text file just containing one column and several rows like this:

Gene Name
A
D
C
B

I need to extract the rows of file 1 that contain gene names listed in file 2.

Does anybody have any idea how to do that?

PS: I know how to do that by excel but it does not work with huge rows of information.

Thank you

RNA-Seq data-mining gene • 6.3k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by ahmad.iut ▴ 90

Ram · Accepted Answer · 2016-02-09

3

Entering edit mode

8.8 years ago

Pierre Lindenbaum 164k

using linux:

join -1 1 -2 1 <(sort -k1,1 file1.txt) <(sort -k1,1 file2.txt)  > joined.txt

using knime.org:

load both files (Read File) in two tables and join using a "Join" node https://www.knime.org/files/nodedetails/_manipulation_column_column_split_combine_Joiner.html

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Dear Lindenbaum,

The command worked perfectly. Thank you very much

ADD REPLY • link 8.8 years ago by ahmad.iut ▴ 90

0

Entering edit mode

thanks a lot, this is very useful also for my problem. Just a question, is it possible to include the header as well? adding --header is not working.

ADD REPLY • link 2.5 years ago by User000 ▴ 710

0

Entering edit mode

adding --header is not working.

should work. check the input files order, check the header is the very first line of both files.

ADD REPLY • link 2.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Actually you are right, it works, but the header is in the middle not on the top... is there a way to keep it on the top? thanks

ADD REPLY • link 2.5 years ago by User000 ▴ 710

0

Entering edit mode

use sed to change the header into someting that should be at the top after sort. Somtehing like 's/^chromosome/00000chromosome/'

ADD REPLY • link 2.5 years ago by Pierre Lindenbaum 164k

Ram · Accepted Answer · 2016-02-09

2

Entering edit mode

8.8 years ago

Benn 8.3k

You can do it with R, with the subset function works pretty intuitively.

ADD COMMENT • link 8.8 years ago by Benn 8.3k

0

Entering edit mode

In R:

file1<-read.table("file1.txt", sep="\t", header=T)
file2<-read.table("file2.txt", sep="\t", header=T)
Selection<-file1[file1$"Gene name" %in% file2$"Gene Name",]

You don't even have to use subset function

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Benn 8.3k

0

Entering edit mode

Dear Nota,

Thank you for your answer. These command in R just gave me the headers:

Gene.Name   Gene.Id     description GO

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by ahmad.iut ▴ 90

0

Entering edit mode

OK, R substitutes the spaces in the header to dots.

So you can use:

Selection<-file1[file1$Gene.Name %in% file2$Gene.Name,]

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Benn 8.3k

1

Entering edit mode

Thank you so much Nota, It worked well. the problem was the spaces in headers (like Gene Name).

ADD REPLY • link 8.8 years ago by ahmad.iut ▴ 90