I'm sorry to tell you that I don't know Python, Perl or Awk well enough to provide an answer. But, I have a working one written in R. You can find it with my comments, so you can understand how it works and, maybe, translate it into Python. Consider that I have three files called fasta_one, fasta_two and refList.
I'm assuming that the fasta sequences are all the same length and that the first row of each fasta file is a reference, and the second row, i.e. [2, 1] means [second row, first column], being a valid fasta sequence which I'm using to compute the length.
refList is tab separated, i.e. it has two columns for those lines that contains two correct references.
Also, notice that I didn't add a condition that tells you if no references are found. Hence, if the files are not compiled accurately, this code may results in errors!
This is the code:
#read files
#stringsAsFactors is set to False, so that R consider the strings as actual strings, the field separator is Tab, and no header is provided in the files!
fastaOne <- read.csv("fasta_one", header=F, sep="\t", stringsAsFactors=F)
fastaTwo <- read.csv("fasta_two", header=F, sep="\t", stringsAsFactors=F)
refList <- read.csv("refList", header=F, sep="\t", stringsAsFactors=F)
#get length of first available fasta sequence, assuming the fasta files are consistent (see comment above about the fasta files)
seqLength <- nchar(fastaOne[2, 1])
#now loop through the refList
for (line in 1:nrow(refList)) {
#test wether the first line contains actual references
if (refList[line, 1] != "" && refList[line, 2] != "") {
#if yes, locate it in fastaOne
fOne <- which(fastaOne==paste(">", refList[line, 1], sep=""))
fTwo <- which(fastaTwo==paste(">", refList[line, 2], sep=""))
print(paste(refList[line, 1], refList[line, 2], sep=" "))
print(paste(fastaOne[fOne+1, 1], fastaTwo[fTwo+1, 1], sep=""))
}
#otherwise the first contains a meaningful reference and the second is empty
else if (refList[line, 1] != "" && refList[line, 2] == ""){
#if yes, locate them in fastaOne and fastaTwo
fOne <- which(fastaOne==paste(">", refList[line, 1], sep=""))
#then print the refList
print(paste(refList[line, 1], refList[line, 2], sep=" "))
#and paste it with dashed-line
print(paste(fastaOne[fOne+1, 1], paste(rep("-", seqLength), collapse=""), sep=""))
}
#or the first is empty and the second contains a meaningful reference
else if (refList[line, 1] == "" && refList[line, 2] != ""){
#if yes, locate it in fastaTwo
fTwo <- which(fastaTwo==paste(">", refList[line, 2], sep=""))
#then print the refList
print(paste(refList[line, 1], refList[line, 2], sep=" "))
#and paste it with dashed-line
print(paste(paste(rep("-", seqLength), collapse=""), fastaTwo[fTwo+1, 1], sep=""))
}
}
Example output with a wrong reference:
[1] "Apple_1-1.1 Fruit_2.2-2"
[1] "AAAAAAAAAAAAACCCCCCCCCCCGGGGGGGGGGGGGTTTTTTTTTTT"
[1] "Tiger-A_a.a Animal-Xya.Za-1"
[1] "ATATATATATATATATATATATATGCGCGCGCGCGCGCGCGCGCGCGC"
[1] "Lemon "
[1] "AAAAAAAAAAAAAAAAAAAAAAAA------------------------"
[1] " Tree"
[1] "------------------------TTTTTTTTTTTTTTTTTTTTTTTT"
[1] "wrongRef "
[1] "------------------------"
how do you know how many spaces are required after the Tree sequence? Or before the Lemon sequence?
Also, you did not mention which tool, i.e. R, python, you would like to use!
The input files are trimmed alignments, so the length is defined. As for the tool, python, perl or awk would be perfect.
What components of the headers differentiate/group the sequences?
e.g. will the Tiger sequence always begin "
Tiger-A
"? Or is itTiger-A
different fromTiger-something else
?Yes, the names differ a lot, so it needs to work with the reference list.
Yes, but what 'connects' the headers? Being on the same line in the key file?
No, not the same line. The headers are "connected" by identical names in the reference file.