Question

Question: Can Anyone Suggest Me A Script To Do A Gene Name Rearrangement

0

Entering edit mode

11.8 years ago

bibb77 ▴ 90

Hi everyone, I'm very newbie at programming, so my task is to transform this genetic data data:

  START    END    GENE
   69346001        69366001        SMN2
  140222001       140240001       PCDHA1  PCDHA@  PCDHA2  PCDHA3  PCDHA4  PCDHA5  PCDHA6  PCDHA7  PCDHA8  PCDHA9  PCDHA10

Into this:

START    END    GENE
 69346001        69366001        SMN2
140222001       140240001       PCDHA1                    
140222001       140240001    PCDHA@
140222001       140240001    PCDHA2
140222001       140240001    PCDHA3
140222001       140240001    PCDHA4
140222001       140240001    PCDHA5
140222001       140240001    PCDHA6
140222001       140240001    PCDHA7
140222001       140240001    PCDHA8
140222001       140240001    PCDHA9
140222001       140240001    PCDHA10

So, basically the scrip has to write each new gene name from column 4 to the last one in a new line, with their respective genomic START and END position... An awk or perl script would be nice, the input has ca. 250 lines containig lines with 3 or more gene names that need to be arranged in the way I just showed.

Thanks in advance

programming awk perl gene annotation • 3.6k views

ADD COMMENT • link updated 11.8 years ago by jing ▴ 10 • written 11.8 years ago by bibb77 ▴ 90

score 4 · Answer 1 · 2013-10-07

4

Entering edit mode

11.8 years ago

swbarnes2 15k

The only way to learn is to try it yourself.

It's not likely that anyone wants to take the time to do this for you. Try it yourself, and if your script doesn't work, you can post it, the error message, explain some of the things you tried to fix it, and you are much more likely to get a helpful response.

People are far more likely to help someone who is trying, rather that someone who sits back and waits for the answer to roll in without lifting a finger.

ADD COMMENT • link 11.8 years ago by swbarnes2 15k

2

Entering edit mode

If you think I'm not triying to solve the problem myself, your are very wrong, I'm in a hurry so I just posted the problem, but if you need more information or PROFF that Im working too, this is the AWK line that i'm triying to use but unsuccesfully:

#! /bin/bash/

for i in {3..18}

do

gawk '{if ($i != "") print $1 "\t" $2 "\t" $i;}' input > output

done

the count start in 3 because its the first gene name and end in 18 because the longest line has 18 columns, and the bucle is to "walk" through columns, but its clear that Im missing something because the imput is any other thing:

69346001 69366001 69346001 69366001 SMN2

140222001 140240001 140222001 140240001 PCDHA1 PCDHA@ PCDHA2 PCDHA3 PCDHA4 PCDHA5 PCDHA6 PCDHA7 PCDHA8 PCDHA9 PCDHA10

ADD REPLY • link 11.8 years ago by bibb77 ▴ 90

3

Entering edit mode

Thanks for posting evidence that you have tried to solve the problem. Next time, just include the information in your question - you'll get a much better response.

ADD REPLY • link 11.8 years ago by Neilfws 49k

score 3 · Answer 2 · 2013-10-07

3

Entering edit mode

11.8 years ago

Devon Ryan 105k

Since your reply to swbarnes2 indicated that you have given this some effort:

BEGIN {
    OFS="\t"
}
{
    for(i=3; i<= NF; i++) {
        print $1, $2, $i
    }
}

Usage: awk -f blah.awk input.txt

I assumed that the original columns were separated by tabs.

ADD COMMENT • link 11.8 years ago by Devon Ryan 105k

1

Entering edit mode

It worked, thank you so much, but I'm shocked that everyone are playing to be judges, giving moral advices, instead of just asking for more information if is needed.

ADD REPLY • link 11.8 years ago by bibb77 ▴ 90

1

Entering edit mode

I don't think that your question is unclear or lacks information, it's just simply off topic - I can't see any bioinformatical problem in it.

In the future you can simply reformulate your question like this:

>Input
1   2    A
3   4    B   C   D   E  

>Wanted output
1   2   A
3   4   B
3   4   C
3   4   D
3   4   E

And ask it on http://stackoverflow.com/

ADD REPLY • link 11.8 years ago by PoGibas 5.1k

1

Entering edit mode

I suggest you not take it personally. We are not being judgemental or moralistic. You need to realise that sites like this one attract a lot of "lazy questions" from casual users who just want someone to do their work for them. It is important that we as a community continually post reminders to discourage such questions and maintain site standards. The first answer did ask for more information and you provided it, which is great.

ADD REPLY • link 11.8 years ago by Neilfws 49k

0

Entering edit mode

Edit: I guess I should have updated prior to commenting. Neilfws already posted effectively the same thing!

Well, some background is in order regarding the reactions you got. It's very frequently the case both here and on seqanswers (and even the Bioconductor email list, on occasion), that people will ask for "help" with a script when what they actually want is for someone to just write it for them. In most of those cases, the person in question has put no effort into doing it themselves and will end up trying to get the whole community to do the entire analysis, including writing scripts, for him/her. After going through that a few times, a lot of people (myself included) start requiring a certain threshold of displayed effort before we're likely to help people. We probably shouldn't become jaded like that, but it's tough to not let a few bad apples spoil the whole bunch.

ADD REPLY • link 11.8 years ago by Devon Ryan 105k

score 0 · Answer 3 · 2013-10-07

I agree with swbarnes2, but then again it is also good to learn from working code...

use 5.16.0;
use Data::Dumper;

my %data;

while (<DATA>){
    next unless /^\d+/;
    chomp;
    my ($start,$end,$genes) = split /\s+/, $_, 3;
    for (split /\s+/,$genes){
        $data{$_} = {'start' => $start,
                     'end'   => $end};
    }
}
print Dumper(%data);

__DATA__
START    END    GENE
69346001        69366001        SMN2
140222001       140240001       PCDHA1  PCDHA@  PCDHA2  PCDHA3  PCDHA4  PCDHA5  PCDHA6  PCDHA7  PCDHA8  PCDHA9  PCDHA10


#output
$VAR1 = 'PCDHA7';
$VAR2 = {
    'end' => '140240001',
    'start' => '140222001'
};
$VAR3 = 'PCDHA2';
$VAR4 = {
    'end' => '140240001',
    'start' => '140222001'
};
$VAR5 = 'PCDHA6';
$VAR6 = {
    'end' => '140240001',
    'start' => '140222001'
};
$VAR7 = 'PCDHA8';
$VAR8 = {
    'end' => '140240001',
    'start' => '140222001'
};
$VAR9 = 'SMN2';
$VAR10 = {
    'end' => '69366001',
    'start' => '69346001'
};
$VAR11 = 'PCDHA@';
$VAR12 = {
    'end' => '140240001',
    'start' => '140222001'
};
$VAR13 = 'PCDHA10';
$VAR14 = {
    'end' => '140240001',
    'start' => '140222001'
};
$VAR15 = 'PCDHA9';
$VAR16 = {
'end' => '140240001',
'start' => '140222001'
};
$VAR17 = 'PCDHA5';
$VAR18 = {
    'end' => '140240001',
    'start' => '140222001'
};
$VAR19 = 'PCDHA1';
$VAR20 = {
    'end' => '140240001',
    'start' => '140222001'
};
$VAR21 = 'PCDHA3';
$VAR22 = {
    'end' => '140240001',
    'start' => '140222001'
};
$VAR23 = 'PCDHA4';
$VAR24 = {
    'end' => '140240001',
    'start' => '140222001'
};