Ok based on your specs stated in comment section it looks like a simple perl script should satisfy. Ergo:
use strict;
use warnings;
use Getopt::Long;
my ($optHelp, $optFile1, $optFile2, $optList);
GetOptions ('h' => \$optHelp, 'a=s' => \$optFile1, 'b=s' => \$optFile2, 'l=s' => \$optList);
if($optHelp || !$optFile1 || !$optFile2 || !$optList ){
print "Usege:\n\n";
print "./program -a file1 -b file2 -l list<.tsv> \n\n";
print "Note: sequences from file2 are replaced with those in file1 according to the list -l\n\n";
exit(0);
}
my %hash1 =();
my %list = ();
open (ONE, "<", $optFile1) or die "$!";
open (TWO, "<", $optFile2) or die "$!";
open (LIST, "<", $optList) or die "$!";
while(<LIST>){
chomp;
/^(.*?)\t(.*)/;
$list{$1} = $2;
}
close LIST;
my $head;
while(<ONE>){
chomp;
if(/>(.*)/){$head = $1; next;}
$hash1{$head} .= $_;
}
close ONE;
my $x = 0;
while(<TWO>){
chomp;
if(/>(.*)/){if(defined $list{$1}){print $_. "\n" . $hash1{$list{$1}} . "\n"; $x = 1}else{$x = 0}}
print $_ . "\n" if $x == 0;
}
close TWO;
You copy the code into a file (regilar decument) and run it by executing :
perl program_name -a first_file -b second_file -l list_of_fasta_ids > output.file
Let me know if there are any problems with the program, since it is untested.
cheers
UPDATE:
corrected bugs are marked with #e. I said I haven't tested it :):
so:
file1
>Simulated_Sequence1
ATGGACGGGATTAATCCTGAATACTCTAACAGAAAGAGCTCCAATTATCATCTATACGGCCGGGAGAGTATCGCATGGGC
ATTAATATCCTATTCACTTC
>Simulated_Sequence2
ATGTTACTACGGGGTGGTCCGTTTAGCTCATATCCATCAACGTGGGACGACCTTGATCGAAGCACCCTCGCACGGTTTTA
TGGTGCTCGGATATACCGCC
>Simulated_Sequence3
ATGTTTTGCGTACAGGGCGTCACCCCCCGCGATTTATTTGCTGGCGAATCAGACGGCCTGGACGGTAGTGCCCGCATATG
TGCTGGTAGCGGCCTTTATG
>Simulated_Sequence4
ATGGACTCATATTTCGCCGCGTTTTACTTTGTTATCTATTGTCTGCTTGACGGCCACGTTGTGTACGGGTACACATCAAA
GTACTGCCGGGTTACAATGT
file2
>hSimulated_Sequence1
ATGTCTATTCTCACCGTTAGACAGGAATCCCGAGTCGAAGGAGGCTTTATGTCTATATGTCGGCATTATCCGTACGGCCT
GTTCTACGTAACATTTTCAT
>hSimulated_Sequence2
ATGCCTCAACAAGTCTTATGGAGGAATTTACGTGCTCCGCGGCCATTATACCCAGGATCACTGGGGATGTACTCACGAAT
AGTGGGTCGTGACAGGCAGG
>hSimulated_Sequence3
ATGGGTGCTGGCGAACAGATCGACCTGCTTGACTCTACCTCCTATCCCGTCCAGGACACTCTAGCTTTTTATAGCCGGGC
CAACCAGGGGAGAGAAATAA
>hSimulated_Sequence4
ATGATCCAATCGAAACCGGAAATGAGATCCATTCGGTGTCCGGGTGAATCTCGCGAGGGATGGTATACATCGCATATTGG
TACTAAGCTCTGGGTATCGG
list
hSimulated_Sequence2 Simulated_Sequence2
hSimulated_Sequence4 Simulated_Sequence3
output:
perl program.pl -a file1 -b file2 -l list
>hSimulated_Sequence1
ATGTCTATTCTCACCGTTAGACAGGAATCCCGAGTCGAAGGAGGCTTTATGTCTATATGTCGGCATTATCCGTACGGCCT
GTTCTACGTAACATTTTCAT
>hSimulated_Sequence2
ATGTTACTACGGGGTGGTCCGTTTAGCTCATATCCATCAACGTGGGACGACCTTGATCGAAGCACCCTCGCACGGTTTTATGGTGCTCGGATATACCGCC
>hSimulated_Sequence3
ATGGGTGCTGGCGAACAGATCGACCTGCTTGACTCTACCTCCTATCCCGTCCAGGACACTCTAGCTTTTTATAGCCGGGC
CAACCAGGGGAGAGAAATAA
>hSimulated_Sequence4
ATGTTTTGCGTACAGGGCGTCACCCCCCGCGATTTATTTGCTGGCGAATCAGACGGCCTGGACGGTAGTGCCCGCATATGTGCTGGTAGCGGCCTTTATG
I leave proper sequence formatting to you as an exercise :)
mxs
What happens with sequences you do not wish to replace in file 2 ? Should they stay there or should they be removed ??
other sequences should be stayed there.
One more question: Are files big? How many entries do they contain and how large is an individual fasta record (average)? Also do you need this to run fast or you can spare some time?
about big. file 1 and 2 composed of about 47000 and 150,000 sequences, respectively. I like everyone would like to run fast
You might just write a program using biopython or bioperl to do that. That will likely prove to be the simplest method.
I guess, but it's not easy for me as I'm basically biologist. Could you please let me this process is right?
Firstly, I can remove sequences of interest in file 2 (my mean is those sequences that must be replaced with other sequences from file 1), then I merged file with desired sequences in file 1 with file 2 by cat command. Is is right?