A script for extracting information related to a list of gene names from a file
2
0
Entering edit mode
8.8 years ago
ahmad.iut ▴ 90

Dear Biostars,

I have a text file containing several rows and columns like this:

"Gene Name"      "Gene Id"       "description"         "GO"
A                1               phosphatase           GO:001256
B                2               synthesize            GO:013154
C                3               methylase             GO:000054
D                4               kinase                GO:001254
E                5               oxigenase             GO:001354
F                6               synthesize            GO:001254

In addition, I have another text file just containing one column and several rows like this:

Gene Name
A
D
C
B

I need to extract the rows of file 1 that contain gene names listed in file 2.

Does anybody have any idea how to do that?

PS: I know how to do that by excel but it does not work with huge rows of information.

Thank you

RNA-Seq data-mining gene • 6.3k views
ADD COMMENT
3
Entering edit mode
8.8 years ago

using linux:

join -1 1 -2 1 <(sort -k1,1 file1.txt) <(sort -k1,1 file2.txt)  > joined.txt

using knime.org:

load both files (Read File) in two tables and join using a "Join" node https://www.knime.org/files/nodedetails/_manipulation_column_column_split_combine_Joiner.html

ADD COMMENT
0
Entering edit mode

Dear Lindenbaum,

The command worked perfectly. Thank you very much

ADD REPLY
0
Entering edit mode

thanks a lot, this is very useful also for my problem. Just a question, is it possible to include the header as well? adding --header is not working.

ADD REPLY
0
Entering edit mode

adding --header is not working.

should work. check the input files order, check the header is the very first line of both files.

ADD REPLY
0
Entering edit mode

Actually you are right, it works, but the header is in the middle not on the top... is there a way to keep it on the top? thanks

ADD REPLY
0
Entering edit mode

use sed to change the header into someting that should be at the top after sort. Somtehing like 's/^chromosome/00000chromosome/'

ADD REPLY
2
Entering edit mode
8.8 years ago
Benn 8.3k

You can do it with R, with the subset function works pretty intuitively.

ADD COMMENT
0
Entering edit mode

In R:

file1<-read.table("file1.txt", sep="\t", header=T)
file2<-read.table("file2.txt", sep="\t", header=T)
Selection<-file1[file1$"Gene name" %in% file2$"Gene Name",]

You don't even have to use subset function

ADD REPLY
0
Entering edit mode

Dear Nota,

Thank you for your answer. These command in R just gave me the headers:

Gene.Name   Gene.Id     description GO
ADD REPLY
0
Entering edit mode

OK, R substitutes the spaces in the header to dots.

So you can use:

Selection<-file1[file1$Gene.Name %in% file2$Gene.Name,]
ADD REPLY
1
Entering edit mode

Thank you so much Nota, It worked well. the problem was the spaces in headers (like Gene Name).

ADD REPLY

Login before adding your answer.

Traffic: 1006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6