merge two mulitfasta files and eliminate fasta with duplicated headers from the first
2
1
Entering edit mode
8.4 years ago

Hi, I have two multifasta files. I want to merge them deleting all those fasta seqences from the first multifasta file which are also in the second file. I need to do it by header comparison, sequences are different under the same headers.

Alternatively, could somebody give me a hint how to generate all the contigs (even those unchanged) through bcftools consensus?

Thanks, Pawel

Assembly genome sequence bcftools fasta • 2.2k views
ADD COMMENT
3
Entering edit mode
8.4 years ago

You can use the BBMap package like this:

filterbyname.sh in=file1.fasta names=file2.fasta exclude out=file1_filtered.fasta
cat file1_filtered.fasta file2.fasta > combined.fasta
ADD COMMENT
0
Entering edit mode

Hi Brian, first of all you've got my deep admiration for tools you produced. I'm using it since one year! When any paper is comming? I would cite with pleasure! Your method worked best because without any additional check repeatmasker swallowed converted first multifasta opposed to two other methods I tried!

ADD REPLY
0
Entering edit mode

Hi Pawel,

A paper on one of the tools should be submitted by the end of next week... I'll probably do some kind of short write-up of the suite overall soon, too, just to make it easier to cite.

-Brian

ADD REPLY
2
Entering edit mode
8.4 years ago

if command line is an option, here's a perl alternative:

cat file1.fasta file2.fasta | perl -ne '
if (/^>/) {
 $header = $_;
 delete $seqs{$header};
} else { $seqs{$header} .= $_ }
END {
foreach $header (keys %seqs) {
print $header.$seqs{$header};
}}'

any file placed secondly in the initial cat will overwrite previous sequences with equal header, as requested.

ADD COMMENT
0
Entering edit mode

Great thanks Jorge, very neat solution. Do you think there might be some problems with ends of the last lines in files? Repeatmasker says some headers are too long but they shouldn't be. There is over 6k sequences and I'm slow at scripting.

ADD REPLY
0
Entering edit mode

it shouldn't have any problem with line endings since it doesn't remove them when acquiring the input. you may force a new line after each line with this alternative option:

cat file1.fasta file2.fasta | perl -ne '
/^(\S+)/ && $line = $1;
if ($line =~ /^>/) {
 $header = "$line\n";
 delete $seqs{$header};
} else { $seqs{$header} .= "$line\n" }
END {
 foreach $header (keys %seqs) {
 print $header.$seqs{$header};
}}'
ADD REPLY

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6