Matching multiple fasta alignments by identifier
2
0
Entering edit mode
9.5 years ago

This might be more of an infrastructure software problem than a biology problem, but I think it definitely would be of use to many.

My problem is that I have two different alignments with hundreds of sequences each. I need to be able to match alignments from the same organisms.

Say I have in file 1:

>BEAR
ADLKASHKQWEM
>CAT
AD-KASHK-W-M

And in file 2:

>BEAR
AP-HEL-ESE
>CAT
APTHELLESE

I want to produce the following file 3 (appended stuff in italics):

>BEAR_merged
ADLKASHKQWEM.AP-HEL-ESE
>CAT_merged
AD-KASHK-W-M.APTHELLESE

Has anyone ever met the same problem?

MSA alignment fasta merge sequence • 2.4k views
ADD COMMENT
3
Entering edit mode
9.5 years ago
5heikki 11k
cat file1
>BEAR
ADLKASHKQWEM
>CAT
AD-KASHK-W-M

cat file2
>BEAR
AP-HEL-ESE
>CAT
APTHELLESE

join -t "  " -1 1 -2 1 -o 1.1,1.2,2.2 <(cat file1 | tr "\n" "\t" | tr "\>" "\n" | sort -k1,1) <(cat file2 | tr "\n" "\t" | tr "\>" "\n" | sort -k1,1) | awk '{print ">"$1"_merged\n"$2"."$3}'
>_merged
.
>BEAR_merged
ADLKASHKQWEM.AP-HEL-ESE
>CAT_merged
AD-KASHK-W-M.APTHELLESE

You just need to remove the first two lines from output. The tab in join (-t " ") is a literal one (ctrl+v+tab). Also, there can be no linebreaks in sequences in file1/2

ADD COMMENT
0
Entering edit mode

Thanks for this reply! A bit quirky in case it the sequences are not in order, but I could accept this as an answer.

ADD REPLY
0
Entering edit mode

Sequence order doesn't matter because of sort -k1,1. However, as I wrote, linebreaks in sequences are absolutely not allowed.

ADD REPLY
1
Entering edit mode

Yep, the sorting step is a good idea. Does this work if the sequences names are not exactly the same (say BEAR is not in file 2)?

ADD REPLY
0
Entering edit mode

Default behavior would then be to omit BEAR. In this case, adding -a 1 to the join command would include unpairable seqs from file1

ADD REPLY
0
Entering edit mode

Oh, you're actually right yes. Linebreaks are a problem though as my sequences are longer than this, and because the aligned fasta format I'm using is using linebreaks. I could easily make a custom fix for this if i needed though

ADD REPLY
0
Entering edit mode

You can replace cat file1/2 with

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' file1/2

to deal with linebreaks..

ADD REPLY
1
Entering edit mode
9.5 years ago
cyril-cros ▴ 950

You could try to use Biopython. Bio.Seq.IO is able to read alignment files and leaves you a list of SeqRecord objects. You can use SeqIO.to_dict to access quickly a sequence using its name (dictionary of SeqRecords). Then, just iter on your sequence names and concatenate the sequences from your two files, accessing them with the two dictionaries you get.

If your sequences are in the exact same order, you could try awk maybe. Please tell use what alignment file format you use.

ADD COMMENT
0
Entering edit mode

This actually sounds like a better idea. I'm using the aligned fasta format, which I assume is quite standard. It would be really great to have an example of how I could do this literally. It's no guarantee that the sequences are in the same order.

ADD REPLY
1
Entering edit mode

The other solution should work if you have the same sequences. I am not using SeqIO to write the alignment because the seq1.seq2 format is unusual. Here is my code, which requires BioPython to be installed:

from Bio.SeqIO import index
dict1,dict2=index("file1.fasta", "fasta"),index("file2.fasta", "fasta")
                                                # Indexing works with large files
elmt1,elmt2=set(dict1),set(dict2)               # Testing you have the same elements
if elmt1-elmt2: print("Some elements in file 1 are not in 2:"," ".join(elmt1-elmt2))
if elmt2-elmt1: print("Some elements in file 2 are not in 1:"," ".join(elmt2-elmt1))
with open('mergedFile.fasta', 'w') as fileOut:
    for seqName in elmt1 & elmt2:               # I assume you only want common elements
        fastaFormatString=">%s\n%s\n" % (seqName,dict1[seqName].seq+"."+dict2[seqName].seq)
        fileOut.write(fastaFormatString)
ADD REPLY
0
Entering edit mode

Thanks a bunch! If I only had two thumbs! (:

ADD REPLY

Login before adding your answer.

Traffic: 1535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6