Question

extract data from txt files

0

Entering edit mode

9.2 years ago

fufuyou ▴ 110

I want to extract gene expression values from a txt file based on gene names.

For example:

I want to extract A and C gene expression values using gene name file extract it.

A 1500
C 3500

I have written some perl code, but it is not work well. Could you help me modify it or give me a new code?

Thanks,

#!/usr/bin/perl -w
use strict;
open(IN,"file1");
open(NAME,"file2");
open(OUT,">result");
while(<IN>){
  chomp;
  @data=split / /,$_;
  $hash{$data[0]}=$data[1];
}
while(<NAME>){
    chomp;
    print OUT "$_ $hash{$_}\n";
}
close IN;
close NAME;
close OUT;

RNA-Seq • 3.6k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by fufuyou ▴ 110

4

Entering edit mode

you are reinventing the wheel: the solution is sorting both files and using linux join: http://linux.die.net/man/1/join

ADD REPLY • link 9.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks Pierre Lindenbaum and Zaag,

What is wrong of my perl code?

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.2 years ago by fufuyou ▴ 110

1

Entering edit mode

Please note that I vote for Pierre's answer! Nonetheless, I will try to shed some light on your programming experiments.

I am not sure, that

@data=split / /,$_;

works as you expect. I believe / / really tries to match $_ against the regular expression containing a single space. Using some parenthesis for your code, one gets:

@data=split( $_ =~ / /, $_ );

The reg-ex matching returns the result by means of 1 and '' (or something similar) which evaluate to boolean TRUE and FALSE in Perl. Consequently, Perl will evaluate the statement to:

@data=split('', $_);

or

@data=split('1', $_);

depending on the content of $_, i.e., does $_ contain a single space. Finally, the line will be splitted by '1' or '' which is not what you want, right?!

Tip 1:

Try to debug your script using Data::Dumper but be aware that you might get a long list of results:

#!/usr/bin/perl -w
use strict;
use Data::Dumper;

open(IN,"file1");
open(NAME,"file2");
open(OUT,">result");
while(<IN>){
  chomp;
  @data=split / /,$_;
  $hash{$data[0]}=$data[1];
}
print Dumper(\%hash);
while(<NAME>){
    chomp;
    print OUT "$_ $hash{$_}\n";
}
close IN;
close NAME;
close OUT;

Tip 2:

If you really want to do it by yourself, think about loading the file into memory which is smaller (I guess these are the names). Applying your script to very large data sets may eventually crash your computer.

Tip 3:

Really try to go for Pierre's solution ;-)

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.2 years ago by Manuel Landesfeind ★ 1.4k

0

Entering edit mode

Hello fufuyou!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=62682

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 9.2 years ago by Pierre Lindenbaum 164k

Ram · Answer 1 · 2015-09-14

0

Entering edit mode

9.2 years ago

Zaag ▴ 870

In the shell:

for i in `cat file2` ;
do awk '$1eq"$i" {print } ' file1 >> result ;
done ;

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.2 years ago by Zaag ▴ 870

0

Entering edit mode

Using $i like this in awk looks like a bad idea. If both file1 and file2 have too many lines, your solution will take a long time.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

True. Didn't know about linux's join command so thanks for that link.

ADD REPLY • link 9.2 years ago by Zaag ▴ 870

0

Entering edit mode

I use this shell only get the name. I cannot get the gene expression values.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by fufuyou ▴ 110