Orthomcl clusters to fasta
2
0
Entering edit mode
6.3 years ago
wocana ▴ 20

Any idea how to extract orthomcl Clusters to fasta?

I got 540 clusters i just want some of them on fasta.

The final output results like this:

cluster000: Ratn|NC_000067.6:7605211..7607094 Ratn|NC_000076.6:46984187..46986072 Ratn|NC_000072.6:126135324..126137210 Ratn|NC_000071.6:43059052..43060937 Ratn|NC_000069.6:92913971..92915859 Ratn|NC_000071.6:137999400..138001284 Ratn|NC_000084.6:3880095..3881981 Ratn|NC_000073.6:72629984..72631867 Ratn|NC_000073.6:14635959..14637845 Ratn|NC_000076.6:121566110..121567992 Ratn|NC_000086.7:51987005..51988889 Ratn|NC_000083.6:9587574..9589459 Ratn|NC_000073.6:82995991..82997877 Ratn|NC_000074.6:122158254..122160140 Ratn|NC_000074.6:4455815..4457701 Ratn|NC_000072.6:32035055..32036940
ORTHO MCL FASTA alignment CLUSTER • 2.2k views
ADD COMMENT
1
Entering edit mode

There is no need to SHOUT. I have removed the uppercase characters from your title.

ADD REPLY
0
Entering edit mode

Is that representation for one cluster cluster000? So you basically want to get the intervals represented there in one fasta sequence?

ADD REPLY
0
Entering edit mode

The cluster im interest of has 260 fasta IDs and i wanna extract them from the fasta file, first i organaized them 1 ID per line as you can see below "list.txt". Im using that script i modified it but its not working at all

ADD REPLY
0
Entering edit mode

im using this script but im just getting 136/260 i dont know why.. my list.txt look like this

CoGr|NW_001584579.1:68053..71823
CoGr|NC_008804.1:48922615..48926373
CoGr|NC_008801.1:545795465..545799243 
CoGr|NC_008804.1:161966386..161970130 
CoGr|NC_008807.1:226381677..226385438 
CoGr|NC_008805.1:191827747..191831534
CoGr|NC_008802.1:270867733..270871532 
CoGr|NC_008801.1:240232298..240236072
CoGr|NC_008803.1:202573007..202576775

SCRIPT

#!/usr/bin/env perl
my $list_file = $ARGV[0];
my $fasta_in = $ARGV[1];
my $fasta_out = $ARGV[2];
open(LIST_FILE, "<", $list_file) or die "could not open '$list_file' : $! \n";
open(FASTA_IN, "<", $fasta_in) or die "could not open '$fasta_in' : $! \n";
open(FASTA_OUT, ">", $fasta_out) or die "could not open $fasta_out : $! \n";
my @headers = ();
while(<LIST_FILE>) {
    chomp;
    next if ( /^\s*$/ );
    push(@headers,_);
}
my $pat = join '|', map quotemeta, @headers;
$/ = ">";
while(<FASTA_IN>) {
    chomp;
    if ( /$pat/ ) { print FASTA_OUT ">$_"; }
}
close(LIST_FILE);
close(FASTA_IN);
close(FASTA_OUT);
ADD REPLY
0
Entering edit mode

Try using Bioperl SeqIO and Bio::DB::Fasta, so you don't have to re-invent the wheels handling FASTA files...

ADD REPLY
1
Entering edit mode
6.3 years ago

Since this is an orthomcl output, you should have a blast database of the proteins you clustered. So select the orhto fam you want, parse out the proteinsIDs and get those from the blastDB.

Get the list ids you need:

grep 'cluster000:' <mcloutput file> | sed 's/cluster000://' | sed 's/ /\n/g' > ids2get.lst

get those ids from the blastdb and write to file

blastdbcmd -db <your protein blastDB> -entry_batch ids2get.lst > proteins.fasta
ADD COMMENT
0
Entering edit mode

Didnt work

Error: [blastdbcmd] CoGr|NW_001583259.1:530924..534690: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:39070112..39072248: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:104399108..104402881: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:127199820..127202817: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:130223934..130226919: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:120801884..120804770: OID not found
Error: [blastdbcmd] CoGr|NC_008805.1:235446572..235450134: OID not found
Error: [blastdbcmd] CoGr|NC_008802.1:315125658..315129411: OID not found
Error: [blastdbcmd] CoGr|NC_008804.1:399210142..399213935: OID not found
Error: [blastdbcmd] CoGr|NC_008802.1:352763833..352767598: OID not found
Error: [blastdbcmd] CoGr|NC_008803.1:129563892..129567698: OID not found
ADD REPLY
1
Entering edit mode

are those the exact same IDs as you have in your input protein file? Can you provide the output of head on your input protein file?

You will at least have to remove the CoGR| part as that is added by orthomcl if I remember correctly

grep 'cluster000:' <mcloutput file> | cut -d ' ' -f2- | sed 's/CoGr|//g' |sed 's/ /\n/g' > ids2get.lst

I also replaced on of the sed cmds with cut (a bit neater)

ADD REPLY
1
Entering edit mode
6.3 years ago
wocana ▴ 20

Thx you all at the end first i modified and then used lieven.sterck command

grep 'cluster000:' <mcloutput file> | sed 's/cluster000://' | sed 's/ /\n/g' > ids2get.lst

and then the script runned correctly

#!/usr/bin/env perl
my $list_file = $ARGV[0];
my $fasta_in = $ARGV[1];
my $fasta_out = $ARGV[2];
open(LIST_FILE, "<", $list_file) or die "could not open '$list_file' : $! \n";
open(FASTA_IN, "<", $fasta_in) or die "could not open '$fasta_in' : $! \n";
open(FASTA_OUT, ">", $fasta_out) or die "could not open $fasta_out : $! \n";
my @headers = ();
while(<LIST_FILE>) {
    chomp;
    next if ( /^\s*$/ );
    push(@headers,_);
}
my $pat = join '|', map quotemeta, @headers;
$/ = ">";
while(<FASTA_IN>) {
    chomp;
    if ( /$pat/ ) { print FASTA_OUT ">$_"; }
}
close(LIST_FILE);
close(FASTA_IN);
close(FASTA_OUT);
ADD COMMENT
0
Entering edit mode

Good to hear you managed, though the retrieve via the blastdbcmd is gonna be miles faster then this approach. but if it works and suits your needs then stick with it.

ADD REPLY

Login before adding your answer.

Traffic: 2370 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6