Script for replacing selenocysteine in a phylip file
2
0
Entering edit mode
6.4 years ago
ahmedmagds • 0

Hi, I need help. I do not have too much experience in coding. I am trying to replace selenocysteine (U) in a phylip file with X. How can I do that? I tried to use Bioperl but still not successful. My trial is written below.

#!/usr/bin/perl                                                                                                                             

use Getopt::Std;
use Bio::Seq;
use Bio::SeqIO;
use Bio::AlignIO;
use Bio::SimpleAlign;

my %opts = ();
getopts ('f:', \%opts);
my $file  = $opts{'f'};                                                                                                                                       

my $alnin = Bio::AlignIO->new (-format=>'phylip', -file=>"$file");
my $alnout = Bio::AlignIO->new (-format=>'phylip', -file=>"$file");
while (my $aln = $alnin->next_aln()){
    my $id       = $aln->id_linebreak();
    my $seq      = $aln->seq();
    $seq =~s/U/X/g;
    $seq =~s/O/X/g;
    print "$id\n$seq\n";
}
sequence • 2.0k views
ADD COMMENT
1
Entering edit mode

If it's simple replacement, you can try sed 's/[U,O]/X/g' test.txt

ADD REPLY
0
Entering edit mode

But this will replace any possibility of U or O in sequence header. e.g. a sequence with name AOUxyz. So if you can modify, then it world be great.

ADD REPLY
0
Entering edit mode

I agree with that @pb. However, I am looking for example input from OP. I think most of the new bie's posts do not post data, expected output and error. If i can catch hold of example file with selenocysteine in phylip format, then I can work it out.

ADD REPLY
0
Entering edit mode

Hello ahmedmagds,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

Hello ahmedmagds,

could you please post an example of your input and desired output? What do you mean by "I tried to use Bioperl but still not successful"? And is perl mandatory or are other solutions fine as well?

fin swimmer

ADD REPLY
0
Entering edit mode

Thanks everyone for the help! Sorry for the bad format and not posting a file example. The sed command is easy to use and quick but the problem that sometimes you have U/O in the names which happens one time in the current file but in other examples would be more. so I will put small part of the file here so we can discuss more:

NUM4039                                                                                                                                                              MIRFANVSKAYLGGKSALQGLSFHLPAGSMTYLVGHSGAGKSTLLKLIMGMERANGGQIWFNGHDITRLSRHEVPFLRRQIGMVHQDYRLLPDRSVLDNVALPLIIAGQHPKDANSRALAALDRVGLRDRANHLPAHLSGGEQQRIDIVRAVVHKPQLLLADEPTGNLDGALSLEIFNLFEEFNRLGMTVLIATHDIGIVQQKPKPCLVLEQGYLRMTISVENLNFFYGAUQALFDINLTADDGDVLVLLGPSGAGKSTLIRTLNLLEVPQSGKLSIANNQFDLSSGNDPKQLRQLRRDVGMVFQQYNLWPHMTVLQNLIEAPMKILGVSETEAKKQALELLQRLRLDEFADRFPLHLSGGQQQRVAIARALMMKPQVLLFDEPTAALDPEITAQIVSIIEELQQTGITQVIVTHEVNVAKKVTTKVVYMEQGHIIEIGDKSCFEQPHTEQFKQYLSHNI???MLDVRNIHKTFNGNQVLKGIDFQIQKGEUVAILGPSGSGKTTFLRCLNLLERAEQGTLHFND-GSLALDFAKKISKADELKLRRRSSMVFQQYNLFPHRTALENVMEGMLVVQKKSKAEAEQRAVELLTKVGLKDKMHLYPSQLSGGQQQRVGIARALAVQPDIILLDEPTSALDPELVGEVLQTLKLLAQEGWTMIIVTHELQFARDVADRVILMADGNVVEQNGAREFFENPQQERTKQFLLQAKI-PVCVEYEI??MIKLKNVSKIFDVSGKKLTALDNVSLDIPKGYICGVIGASGAGKSTLIRCVNLLEKPTMGAVIIDGNDLTQLSDAELVLERRNIGMIFQHFNLLSSRTVFDNVALPLELENTPKENIESKVNELLSLVGLSDKRNVYPSNLSGGQKQRVAIARALASNPKVLLCDEATSALDPATTQSILKLLKEINRTLGITILLITHEMDVVKRICDSVAIIDQGKLVEQGSVSDIFSNPKTELAQQFIRSTFNVNLPDEYLDNLLQTPKHAKSYPIIKFEFTGRSVDAPLLSQTSKKFGVELSILTSQIEYAGGVKFGFTVAEVEGDEDAITQAKIYLMENNVRVEVLGYVEMNEQVERKLLLEVNHLGVNFKIKNDKSLFFAKPQTLKAVKDVSFKLYAGETLGVVGESGCGKSTLARAIIGLVEASEGQILWLGKDLRKQSAKQWRNTRKDIQMIFQDPLASLNPRMNIGEIIAEPLKIYQPHLSKAQVKEKVQAMMLKVGLLPNLINRYPHEFSGGQCQRIGIARALIIEPKMIICDEPVSALDVSIQAQVVNLLKSLQKEMDLSLIFIAHDLAVVKHISDRVLVMYLVNAMELGTDDEVYKHTKHPYTKALMSAVPIPDPKLERNKSIQLLEGDLPSPINPPSGCVFRTRCLLADDSCAQQKPVFNSDNNSHFVACLKVSMPLLQVEDLTKSFKDSFGLFSSRHFHAVEQISFSLETGKTLAIIGRNGSGKSTLAKMIVGITKPTSGNILFKDNPLVFGDYHYRAKHIRMIFQDPNTAFNPRLNVGQVLDAPLLLTTKFDEQQRNQKIFDILKLVGMHPDHTNIKINTLSVSQKQRIALARALILNPEVVIIDDALGSLDATVKTQLTNLVLELQEKLKLAYIYVGQNLGIIKHIADTILVMEDGKMIEYGDTHSLFTAPKTDVTKRLVESHFGKILDDSSWKNEDLANKVRMTNSTHLLDVQNLHVGFKTPDGIVTAVNDLNFTLDAGHTLGIVGESGSGKSQTAFALMGLLAPNGEVSGSALFDEVQLVNLPIEKLNKIRAEQISMIFQDPMTSLNPYMKIGEQLMEVLILHKGYDKQTAFNESVKMLDAVKMPEAKKRMGMYPHEFSGGMRQRVMIAMALLCRPKLLIADEPTTALDVTVQAQIMTLLNELKHEFNTAIIMITHDLGVVAGICDHVLVMYAGRTMEYGNAEQIFYHPTHPYSIGLMDAIPRLDMDEEHLVTIPGNPPNLLHLPQGCPFSPRCRFASEQCQKTPPKLTALHDGRLRNCWLPAEEFALMSENIILELKNISKRFFGVTVLDDIHLDIRQGEVLCLIGENGAGKSTLCKIIAGIYHCDEGEMFYSDRKYSPDTVKQAQEAGIGFIHQELMLVPKLTVLENIFLGSEKTSSLGCMNWTVMREKTQHIINELELDIKPDDLISDLSIAQQQMVEIAKAVFSEYKIIIFDEPTSSISRKNTEVLFNIIHQLKAKGVAMIYISHRLEEFKYIADRVTVLRDGRITGTMRYQDTSPEEIVRLMVGRKIDFTRYLRTGSFNQEKLRVENLHNKYIKPISFSVNKGEILGFAGLVGAGRTEVLRAIYGADQTSGKIYIDGKEIKINSPEDAVKHKIGLITEDRKSQGLVLGMSIRENITLPILKRFWRKFYLDKKQERRVAEKNRTKLHIVSHDQEQQTKTLSGGNQQKVILARWLESGVDILFFDEPTRGIDIGAKSEIYDLMRQFTESGGTIVMVSSDLPELITISDHIVVMRNGEKIKEITDRTEITEENLMHLMIGV
ADD REPLY
0
Entering edit mode

Those can be done away with awk and sed. @OP

ADD REPLY
2
Entering edit mode
6.4 years ago
pbpanigrahi ▴ 430

Since you have not given any input file, but mentioned the type to be phylip. I took an example phylip format file from here and try to substitute U-> X and O->X randomly and tried the code below.

 cat phylip.txt | perl -ne 'my @x=split /( +)/,$_; print $x[0]; map{ my $y=$_; $y=~s/U/X/g;$y=~s/O/X/g; print $y;}@x[1..$#x];'

What the script does

It loops over each line, skips first column, and then look for U/O in rest of the data and substitute with X.

It is a quick solution, not well tested. You can check with your file if this works.

Thanks

Priyabrata

ADD COMMENT
0
Entering edit mode

pb.panigrahi86 +1 for that. I love one liners!

ADD REPLY
0
Entering edit mode
6.4 years ago

Now with the example, here a solution with sed and awk as cpad0112 suggested.

$ sed 's/ /\t./g' phylip.txt|awk -v OFS="\t" '{id=$1; gsub("[OU]", "X", $0); $1=id; print}'|sed 's/\t./ /g'

fin swimmer


EDIT:

Pure awk solution:

$ awk '{match($0, "(^[^ ]*)(.*)", s); gsub("[OU]", "X", s[2]); print s[1]s[2]}' phylip.txt
ADD COMMENT
0
Entering edit mode

for the example provided by OP: this may be enough (assuming that IDs are in first column and sequences are in 2nd column):

$ awk -v OFS="\t" '{gsub("[OU]", "X",$2)}1' test.txt
ADD REPLY
0
Entering edit mode

Hello cpad,

if I understood this crazy phylip file format correct, it is important how many whitespaces are between the ID and the sequences. There seems to be two possibilities. The ID column is exactly 10 characters width or all sequences in the file start at the same position.

This is why I choose the long way round with sed.

fin swimmer

ADD REPLY
0
Entering edit mode

then let us change after 10 characters and skip first line (or whatever number of lines).

May be this should work with one id and one sequence per line:

output: skip first line, then insert a tab after 10 characters, replace characters in all lines except first line and remove the tab. Even if headers contain tabs, it works.

sed  '1! s/^.\{10\}/&\t/g' sed.text | awk -v OFS="" 'NR!=1 {gsub("[OU]", "X",$2)}1'

abcdef  OUjiOU
abcdefOUjiXX

input:

$ cat sed.text 
abcdef  OUjiOU
abcdefOUjiOU
ADD REPLY

Login before adding your answer.

Traffic: 1880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6