Deleting A String Which Is Part Of A Sequence Id
6
2
Entering edit mode
12.9 years ago
Oduro ▴ 20

I have a file with multiple sequences and sequence id;

>LR24F01_Bbc10_2_15-53841001-53841229
atgcccgccccccgcgccgcccccccctctctcgct
>MT24F01_Bbc10_2_15-53841001-53841229
atgcccgccccccgcgccgcccaaccctctctcgct
>LP39F01_Bbc10_2_15-53841001-53841229
atgcccgctccccgcgccgcccaaccctctctcgct
...... etc

I want to find out if someone can help with a simple perl script or linux command line so I can get rid of this portion:-53841001-53841229 of the sequence id: I want my output to look like this:

>LR24F01_Bbc10_2_15
atgcccgccccccgcgccgcccccccctctctcgct
>MT24F01_Bbc10_2_15
atgcccgccccccgcgccgcccaaccctctctcgct
>LP39F01_Bbc10_2_15
atgcccgctccccgcgccgcccaaccctctctcgct
...... etc

Thanks.

sequence format • 2.5k views
ADD COMMENT
3
Entering edit mode

Just a note for everyone: when lines begin with ">", BioStar formats the text as a blockquote. This can confuse people giving answers, since they don't realise that the sequence was supposed to be in fasta format. If you indent lines with 4 spaces, fasta displays properly.

ADD REPLY
3
Entering edit mode
12.9 years ago
Frozenwithjoy ▴ 180

Try this 'simple' linux approach:

cut -d - -f1 old.fa > new.fa

It cuts your file into fields using '-' as the delimiter and then returns the first field.

ADD COMMENT
2
Entering edit mode
12.9 years ago
boczniak767 ▴ 870

You can replace first hyphen with space (I assume that sequence id and sequence are separated with space character).[?] You'll get file with three space-delimited fields, from which you can cut first and third.

sed 's/-/ /' your_file | cut -f1,3 -d" " > result
ADD COMMENT
0
Entering edit mode

This answer is incorrect, but it's not your fault. The sequences were not displayed properly in the original question - they are supposed to be fasta format.

ADD REPLY
2
Entering edit mode
12.9 years ago
Neilfws 49k

Provided that all of your header lines are of the form given in your question, the following sed should work:

sed 's/-[0-9]*-[0-9]*//g' myfile.fa > mynewfile.fa

That says: "replace hyphen, followed by digits, followed by hyphen, followed by digits, with nothing."

ADD COMMENT
0
Entering edit mode

A more generic sed -e 's/-.*//' myfile.fa > mynewfile.fa also works (use sed -i to directly modify the source file).

ADD REPLY
0
Entering edit mode
12.9 years ago
Gjain 5.8k

If you are using perl,

while(my $line=<INFILE>){
    chomp($line);
    next if(($line =~ m/^\#/) or (($line =~ m/^\"/))or ($line eq "")or ($line =~ /^\s*$/)); # skip comments, or empty lines

    #line = >LR24F01_Bbc10_2_15-53841001-53841229

    # Split the line on "-"
    my ($id, $start_coord, $end_coord) = split(/\-/,$line);

    # the result will be
    # id = LR24F01_Bbc10_2_15
    # start_coord = 53841001
    # end_coord = 53841229

    # to get sequence, split it on " " this time and get the second part
    my $sequence = <INFILE>;
    chomp $sequence;
    #sequence = atgcccgccccccgcgccgcccccccctctctcgct

    # now print them all together
    print "$id\n$sequence\n";
}

I wrote this in detail, so that you can understand each step. As you can see it can be done in one line command by Maciej, its good to understand how you can break the problem and apply it.

I hope this helps.

ADD COMMENT
0
Entering edit mode

See my comment for Maciej, above.

ADD REPLY
0
Entering edit mode

well this can be adapted for that.

ADD REPLY
0
Entering edit mode

i had modified my code.

ADD REPLY
0
Entering edit mode
12.9 years ago

To complete the two sed-based examples, here is a pure Bash solution (based on pattern substitution):

while read l ; do echo "${l%%-*}" ; done < myfile.fa

Edit: Note that it doesn't apply to sequences with gaps.

ADD COMMENT
0
Entering edit mode
12.9 years ago
Woa ★ 2.9k

If you're very sure that the hyphen appears only in the undesired part of the Id for all of the sequences, you can do a following perl split

while(<INPUTFILEHANDLE>){
      if(/^>/){
             my ($id, $unwanted)=split(/\-/,$_,2);
             print $id,"\n";
       }
      else{
             print $_,"\n";
       }
}
ADD COMMENT

Login before adding your answer.

Traffic: 1965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6