Question

Deleting A String Which Is Part Of A Sequence Id

2

Entering edit mode

12.9 years ago

Oduro ▴ 20

I have a file with multiple sequences and sequence id;

>LR24F01_Bbc10_2_15-53841001-53841229
atgcccgccccccgcgccgcccccccctctctcgct
>MT24F01_Bbc10_2_15-53841001-53841229
atgcccgccccccgcgccgcccaaccctctctcgct
>LP39F01_Bbc10_2_15-53841001-53841229
atgcccgctccccgcgccgcccaaccctctctcgct
...... etc

I want to find out if someone can help with a simple perl script or linux command line so I can get rid of this portion:-53841001-53841229 of the sequence id: I want my output to look like this:

>LR24F01_Bbc10_2_15
atgcccgccccccgcgccgcccccccctctctcgct
>MT24F01_Bbc10_2_15
atgcccgccccccgcgccgcccaaccctctctcgct
>LP39F01_Bbc10_2_15
atgcccgctccccgcgccgcccaaccctctctcgct
...... etc

Thanks.

sequence format • 2.5k views

ADD COMMENT • link updated 12.9 years ago by Woa ★ 2.9k • written 12.9 years ago by Oduro ▴ 20

3

Entering edit mode

Just a note for everyone: when lines begin with ">", BioStar formats the text as a blockquote. This can confuse people giving answers, since they don't realise that the sequence was supposed to be in fasta format. If you indent lines with 4 spaces, fasta displays properly.

ADD REPLY • link 12.9 years ago by Neilfws 49k

score 3 · Answer 1 · 2012-01-14

3

Entering edit mode

12.9 years ago

Frozenwithjoy ▴ 180

Try this 'simple' linux approach:

cut -d - -f1 old.fa > new.fa

It cuts your file into fields using '-' as the delimiter and then returns the first field.

ADD COMMENT • link 12.9 years ago by Frozenwithjoy ▴ 180

score 2 · Answer 2 · 2012-01-13

2

Entering edit mode

12.9 years ago

boczniak767 ▴ 870

You can replace first hyphen with space (I assume that sequence id and sequence are separated with space character).[?] You'll get file with three space-delimited fields, from which you can cut first and third.

sed 's/-/ /' your_file | cut -f1,3 -d" " > result

ADD COMMENT • link 12.9 years ago by boczniak767 ▴ 870

0

Entering edit mode

This answer is incorrect, but it's not your fault. The sequences were not displayed properly in the original question - they are supposed to be fasta format.

ADD REPLY • link 12.9 years ago by Neilfws 49k

score 2 · Answer 3 · 2012-01-13

2

Entering edit mode

12.9 years ago

Neilfws 49k

Provided that all of your header lines are of the form given in your question, the following sed should work:

sed 's/-[0-9]*-[0-9]*//g' myfile.fa > mynewfile.fa

That says: "replace hyphen, followed by digits, followed by hyphen, followed by digits, with nothing."

ADD COMMENT • link 12.9 years ago by Neilfws 49k

0

Entering edit mode

A more generic sed -e 's/-.*//' myfile.fa > mynewfile.fa also works (use sed -i to directly modify the source file).

ADD REPLY • link 12.9 years ago by Frédéric Mahé ★ 3.2k

score 0 · Answer 4 · 2012-01-13

0

Entering edit mode

12.9 years ago

Gjain 5.8k

If you are using perl,

while(my $line=<INFILE>){
    chomp($line);
    next if(($line =~ m/^\#/) or (($line =~ m/^\"/))or ($line eq "")or ($line =~ /^\s*$/)); # skip comments, or empty lines

    #line = >LR24F01_Bbc10_2_15-53841001-53841229

    # Split the line on "-"
    my ($id, $start_coord, $end_coord) = split(/\-/,$line);

    # the result will be
    # id = LR24F01_Bbc10_2_15
    # start_coord = 53841001
    # end_coord = 53841229

    # to get sequence, split it on " " this time and get the second part
    my $sequence = <INFILE>;
    chomp $sequence;
    #sequence = atgcccgccccccgcgccgcccccccctctctcgct

    # now print them all together
    print "$id\n$sequence\n";
}

I wrote this in detail, so that you can understand each step. As you can see it can be done in one line command by Maciej, its good to understand how you can break the problem and apply it.

I hope this helps.

ADD COMMENT • link 12.9 years ago by Gjain 5.8k

0

Entering edit mode

See my comment for Maciej, above.

ADD REPLY • link 12.9 years ago by Neilfws 49k

0

Entering edit mode

well this can be adapted for that.

ADD REPLY • link 12.9 years ago by Gjain 5.8k

0

Entering edit mode

i had modified my code.

ADD REPLY • link 12.9 years ago by Gjain 5.8k

score 0 · Answer 5 · 2012-01-14

0

Entering edit mode

12.9 years ago

Frédéric Mahé ★ 3.2k

To complete the two sed-based examples, here is a pure Bash solution (based on pattern substitution):

while read l ; do echo "${l%%-*}" ; done < myfile.fa

Edit: Note that it doesn't apply to sequences with gaps.

ADD COMMENT • link 12.9 years ago by Frédéric Mahé ★ 3.2k

score 0 · Answer 6 · 2012-01-16

If you're very sure that the hyphen appears only in the undesired part of the Id for all of the sequences, you can do a following perl split

while(<INPUTFILEHANDLE>){
      if(/^>/){
             my ($id, $unwanted)=split(/\-/,$_,2);
             print $id,"\n";
       }
      else{
             print $_,"\n";
       }
}