You mention needing/wanting to use bash, but if you tried something, you should show it along with any error messages.
Linearize your sequences:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < infile.fa > infile_single.fasta
Python:
#!/usr/bin/env python
import sys, itertools
def comp_seqs(a, b, i):
mismatch = []
mismatch.append(i)
one = list(a)
two = list(b)
for n in range(len(one)):
if one[n] != 'X' and two[n] != 'X':
if one[n] != '*' and two[n] != '*':
if one[n] != '-' and two[n] != '-':
if one[n] != two[n]:
mismatch.append(two[n] + str(n+1) + one[n])
return mismatch
with open(sys.argv[1], 'r') as f:
myseqs = []
for line in f:
if line.startswith(">"):
myseqs.append((line.strip().split('>')[1].split(' ')[0], next(f).strip()))
target = myseqs[-1]
myseqs.remove(target)
with open(sys.argv[1].split('.fasta')[0] + '_mm.txt', 'w') as out:
results = []
for i in myseqs:
vs = comp_seqs(i[1], target[1], i[0])
results.append(vs)
for x in itertools.izip_longest(*sorted(results), fillvalue=''):
print >> out, '\t'.join(str(i) for i in x)
Save as comp_seqs.py
, run as python comp_seqs.py in_file.fasta
. If you have lots of FASTA to compare, use a for loop in your terminal:
for file in *.fasta; do python comp_seqs.py $file; done
Thanks for the python solution. I'm glad we share warm feelings for that language. But I have some feedback/comments on your code, feel free to let me know if you disagree.
I'm pretty sure the code below can be improved:
What about something like:
I would expect this is quicker, too.
And this part would benefit from a) a list comprehension and b) biopython (although additional dependency)
You could do something like:
This synthax for writing to an output file is discouraged:
Instead, you should use:
@WouterDeCoster, thanks for your feedback. I broke the
if
statements for checking the characters in the sequences down so hopefully the OP might have an easier learning curve in understanding the code. Yournot
andand
statements are surely more pythonic. For this reason, I also didn't use biopython, so the OP can see how one can open a file, and read line by line. As far as writing to a file, I am using 2.7, is the syntax I used only discouraged in later versions, or across all of them? I understand to useout.write
for versions 3+.+1 for writing educational code, but you could consider supplying both "versions" to show "easy" and "good" code.
Hmm. Possible. I can't find a reference for this.
But in my not so humble opinion,
>>
looks damn ugly and unpythonic.But that's personal :-p
I would recommend writing python3 whenever possible, in the future python2 will get discontinued completely and all code containing old syntax will break. So since your code will be here on biostars forever, better write future-proof ;-)
Thank you so much for this, this seems like it would do what I need. However rather than using the A, B, C example names I should probably have clarified that each FASTA contains unique sequence names, e.g. the first two are always in the format
ZM[10-9999]M
ZM[10-9999]F
while the third is always Consensus_MF
Here is a link to a sample file.
How would I need to change the script to account for this?
I updated it to handle different ids, based on the header in the fasta file. This solution works if there is always only 3 sequences per file, and you are always comparing 1 vs 3, 2 vs 3. Your example file also in nucleotides, and contains '-'.
Apologies, seems I attached the wrong file. Here is the protein alignment. It still contains gaps (-) and stops (*) so I suppose these could be excluded along with X.
When running the script I got it to output TXT files but I encountered two issues:
the amino acid position numbers are all off by -1
the comparison seems to terminate after comparing only the first ~50 or so amino acids
e.g. for the attached protein alignment FASTA the result is
Updated to exclude '-', '*'. Python starts at 0, updated to show mismatch position starting from count 1. Your example file is multi-line fasta, use the above
awk
command to linearize each sequence.Thanks a bunch, works like a charm!
Welcome to bioinformatics.
I was wondering if you could help me with a modification to the script, if it's not too much trouble.
Initially I was using separate input files in the below format:
However rather than making individual files I would rather use one super-file, e.g.:
Example file here. This will hold many more sequences but the last sequence (XX) will always be the comparator.
Ideally it would then successively print the results of every comparison into one output file, e.g. with the results of:
Alternatively, is it possible to print this directly into an Excel spreadsheet where the result of each comparison occupies one column?
so you want to compare all the items in a fasta to the last item in the fasta? Do you want each comparison, one per row, as in your result example? Or a single row, with each comparison in a column?
First question, yes. Second question, all the results are ultimately to go into Excel as shown below so I'm looking for whatever format will most easily facilitate that.
Ok. I updated the above answer for this. Although, a moderator may tell you you should've added a new question, and referenced this post. I tested on your example fasta.
Thank you, I'll keep that in mind. I tested the script and it seems to output first all the values for the -M headers in a series of columns, followed by those for all the -F headers. Would it be possible for the matching headers to be paired next to each other as shown in the image? e.g. M/F, M/F
Ok, updated. The order the script ran was dependent on the order of the sequences in the input file. I added a sort, so the headers will be F/M..F/M for each one. F/M b/c alphabetical.
Thanks, runs as expected. Would it also be possible to make a modification where, if it detects X, - or * in a position then it will not print that position in the corresponding site for the other cognate header?
e.g. an input of
should print only
ZM0001F[C3V] being ignored because ZM0001M[3] = *
Since you stated this is for a bachelor's project, I think you should try to understand what each line of code is doing. Try writing pseudocode to help. At least you will learn something and know how it works, if you intend to use this analysis in your project.
For now, comment out out the
'if'
statements (placing a octothorp (#
) at the beginning of each line) in thecomp_seqs
function, so that all mismatches are outputted. You can then manually filter in Excel.I removed the
if
statements and as expected all mismatches are printed, but I am not sure how the output can be filtered in Excel in order to accomplish the aforementioned. Or at least, it would require some inventive syntax. I had a go at a pseudo-function below:From the output of the program, you know the position of each mismatch, and each mismatch is printed in order that it was seen. Knowing the position and comparing two columns shouldn't be too difficult.