I haven't tested this extensively, but I think this is working correctly. I've made a few assumptions about the data and that there wont be edge cases based on the answers above, but YMMV and I would test it thoroughly to make sure its aligning with expectation before relying on it.
import argparse
import sys
def parsecls(filename):
with open(filename, 'r') as fh:
content = fh.read()
content = [item.rstrip("\n") for item in content.split(">")]
d = {key: value.split(" ") for (key, value) in [item.split("\n") for item in content[1:]]}
# 1: because an empty string is made during the split
return d
def intersect(a, b):
return {k: a[k] for k in a.keys() & b.keys()}
def main():
try:
parser = argparse.ArgumentParser(
description="Comparing clusters between files."
)
parser.add_argument(
"--reference",
"-r",
action="store",
help="Comparison file 1.",
)
parser.add_argument(
"--query",
"-q",
action="store",
help="Comparison file 2.",
)
args = parser.parse_args()
except NameError:
sys.stderr.write(
"An exception occured with argument parsing. Check your provided options."
)
# Read in files, create sets based on clusters
refclusters = parsecls(args.reference)
print("Ref Clusters")
[print(k,v) for k, v in refclusters.items()]
queryclusters = parsecls(args.query)
print("Query Clusters")
[print(k,v) for k, v in queryclusters.items()]
# Perform comparisons (a dict intersection)
print("Common Clusters")
print(intersect(refclusters, queryclusters))
if __name__ == "__main__":
main()
I've taken the strategy of doing an 'all-vs-reference' approach here. The script performs a 1-to-1 comparison, and returns the names of the clusters and their entries in common between the files (based on the values in the cluster). I think this will currently break if the values in the cluster move around, but that can be solved by adding some extra filtering and sorting steps I believe, but this works on the test data as far as I can tell.
It shouldn't be too horrendous from a memory perspective, but it is holding the file in memory while it's read. At 40Mb or so that's probably OK, but if it becomes a big challenge, then some tweaks can be made to use more iteration instead.
Use: python scriptname.py --reference MainFile.cls --query FileN.cls
(short args also supported -r|-q
)
To do the full comparison, wrap in a bash loop:
for file in /path/to/files/File*.cls ; do
python scriptname.py -r MainFile.cls -q "$file"
done
Example:
MainFile.cls:
>CL1 2
ler_1 lcu_2 ler_3
>CL2 3
lcu_6 ler_4 lcu_8
>CL3 3
lcu_6 lcu_4 lcu_8
FileB.cls
>CL1 2
ler_1 ler_2 ler_3
>CL2 3
ler_6 ler_4 ler_8
>CL4 8
ler_10 ler_0 ler_2
Run:
$ python cluster_comparison.py -r TestData/MainFile.cls -q TestData/FileB.cls
Ref Clusters
CL1 2 ['ler_1', 'lcu_2', 'ler_3']
CL2 3 ['lcu_6', 'ler_4', 'lcu_8']
CL3 3 ['lcu_6', 'lcu_4', 'lcu_8']
Query Clusters
CL1 2 ['ler_1', 'ler_2', 'ler_3']
CL2 3 ['ler_6', 'ler_4', 'ler_8']
CL4 8 ['ler_10', 'ler_0', 'ler_2']
Common Clusters
{'CL1 2': ['ler_1', 'lcu_2', 'ler_3'], 'CL2 3': ['lcu_6', 'ler_4', 'lcu_8']}
What does "friendly" output look like in this case? Are you purely looking for clusters that are in common between the "Main file" and each of the 5(?) other files? Or do you need a pairwise 6-way comparison?
I would need to manually check the output, since the files are huge, I can't even open the graphical results.
I need to check, for example, if Ideally, CL2 from File A is the same as CL3 from the main file, as well as for the other six files.
So to make sure I understand the problem, there is no correspondence between cluster names in each of the files?
As such, you need to identify which lines are in common and then find the headers which belong to those clusters in each file?
A few more questions:
Are the
lcu_#
strings in each cluster always in the same order in each file, or can they occur in any order within the cluster?Are the clusters always on a single line or can they wrap like a normal fasta?
They have the same cluster name but no correspondence for those names.
As such, you need to identify which lines are in common and then find the headers which belong to those clusters in each file? Yes, I need to find the lines that are in common and also know which name or cluster they belong to in all the files against the Main one.
Are the lcu_# strings in each cluster always in the same order in each file, or can they occur in any order within the cluster? They can occur in any order.
Are the clusters always on a single line or can they wrap like a normal fasta? Always each cluster in one line.
Thanks a lot for trying to help!!
Few more Qs:
will each
lcu_#
only appear once per cluster? If not, should a cluster be considered different if it contains only the same 'members' as another, but in different numbers?To clarify you are only interested in comparing files 1-5 against the "Main" file, not against one another?
Yes, only one time per cluster.
Actually, it doesn't matter. Some clusters should be the same in all the files, but in the end, I will work only with the main file. I could do an all against all or 5 against the main one. The easier way better.