You just need to remove the first two lines from output. The tab in join (-t " ") is a literal one (ctrl+v+tab). Also, there can be no linebreaks in sequences in file1/2
Oh, you're actually right yes. Linebreaks are a problem though as my sequences are longer than this, and because the aligned fasta format I'm using is using linebreaks. I could easily make a custom fix for this if i needed though
You could try to use Biopython. Bio.Seq.IO is able to read alignment files and leaves you a list of SeqRecord objects. You can use SeqIO.to_dict to access quickly a sequence using its name (dictionary of SeqRecords). Then, just iter on your sequence names and concatenate the sequences from your two files, accessing them with the two dictionaries you get.
If your sequences are in the exact same order, you could try awk maybe. Please tell use what alignment file format you use.
This actually sounds like a better idea. I'm using the aligned fasta format, which I assume is quite standard. It would be really great to have an example of how I could do this literally. It's no guarantee that the sequences are in the same order.
The other solution should work if you have the same sequences. I am not using SeqIO to write the alignment because the seq1.seq2 format is unusual. Here is my code, which requires BioPython to be installed:
from Bio.SeqIO import index
dict1,dict2=index("file1.fasta", "fasta"),index("file2.fasta", "fasta")
# Indexing works with large files
elmt1,elmt2=set(dict1),set(dict2) # Testing you have the same elements
if elmt1-elmt2: print("Some elements in file 1 are not in 2:"," ".join(elmt1-elmt2))
if elmt2-elmt1: print("Some elements in file 2 are not in 1:"," ".join(elmt2-elmt1))
with open('mergedFile.fasta', 'w') as fileOut:
for seqName in elmt1 & elmt2: # I assume you only want common elements
fastaFormatString=">%s\n%s\n" % (seqName,dict1[seqName].seq+"."+dict2[seqName].seq)
fileOut.write(fastaFormatString)
Thanks for this reply! A bit quirky in case it the sequences are not in order, but I could accept this as an answer.
Sequence order doesn't matter because of
sort -k1,1
. However, as I wrote, linebreaks in sequences are absolutely not allowed.Yep, the sorting step is a good idea. Does this work if the sequences names are not exactly the same (say BEAR is not in file 2)?
Default behavior would then be to omit BEAR. In this case, adding
-a 1
to thejoin
command would include unpairable seqs from file1Oh, you're actually right yes. Linebreaks are a problem though as my sequences are longer than this, and because the aligned fasta format I'm using is using linebreaks. I could easily make a custom fix for this if i needed though
You can replace
cat file1/2
withto deal with linebreaks..