Question

Script or tool to compare more than two .cls files in R or Python

0

Entering edit mode

2.4 years ago

Alex S ▴ 20

Hello,

I have six .cls files structured as a .fasta file as follows:

File A

>CL1 2
lcu_1 lcu_2 lcu_3 ...
>CL2 3
lcu_6 lcu_4 lcu_8..

File B

>CL1 2
ler_1 ler_2 ler_3 ...
>CL2 3
ler_6 ler_4 ler_8..

Main File

>CL1 2
ler_1 lcu_2 ler_3 ...
>CL2 3
lcu_6 ler_4 lcu_8...
>CL3 3
lcu_6 lcu_4 lcu_8..

I want to compare the Main file, line by line, with the other six files to highlight their similarities, anyone knows a tool or script to do that? The files are huge, around 43 Mb.

I've tried

> diff

From R but the output is not very friendly considering the size of the files I'm handling.

Thanks

R Python • 1.4k views

ADD COMMENT • link updated 2.4 years ago by Joe 21k • written 2.4 years ago by Alex S ▴ 20

0

Entering edit mode

What does "friendly" output look like in this case? Are you purely looking for clusters that are in common between the "Main file" and each of the 5(?) other files? Or do you need a pairwise 6-way comparison?

ADD REPLY • link 2.4 years ago by Joe 21k

0

Entering edit mode

I would need to manually check the output, since the files are huge, I can't even open the graphical results.

I need to check, for example, if Ideally, CL2 from File A is the same as CL3 from the main file, as well as for the other six files.

ADD REPLY • link 2.4 years ago by Alex S ▴ 20

0

Entering edit mode

So to make sure I understand the problem, there is no correspondence between cluster names in each of the files?

As such, you need to identify which lines are in common and then find the headers which belong to those clusters in each file?

A few more questions:

Are the lcu_# strings in each cluster always in the same order in each file, or can they occur in any order within the cluster?

Are the clusters always on a single line or can they wrap like a normal fasta?

ADD REPLY • link 2.4 years ago by Joe 21k

0

Entering edit mode

They have the same cluster name but no correspondence for those names.

As such, you need to identify which lines are in common and then find the headers which belong to those clusters in each file? Yes, I need to find the lines that are in common and also know which name or cluster they belong to in all the files against the Main one.

Are the lcu_# strings in each cluster always in the same order in each file, or can they occur in any order within the cluster? They can occur in any order.

Are the clusters always on a single line or can they wrap like a normal fasta? Always each cluster in one line.

Thanks a lot for trying to help!!

ADD REPLY • link 2.4 years ago by Alex S ▴ 20

0

Entering edit mode

Few more Qs:

will each lcu_# only appear once per cluster? If not, should a cluster be considered different if it contains only the same 'members' as another, but in different numbers?

To clarify you are only interested in comparing files 1-5 against the "Main" file, not against one another?

ADD REPLY • link 2.4 years ago by Joe 21k

0

Entering edit mode

Yes, only one time per cluster.

Actually, it doesn't matter. Some clusters should be the same in all the files, but in the end, I will work only with the main file. I could do an all against all or 5 against the main one. The easier way better.

ADD REPLY • link 2.4 years ago by Alex S ▴ 20

score 1 · Answer 1 · 2022-07-12

I haven't tested this extensively, but I think this is working correctly. I've made a few assumptions about the data and that there wont be edge cases based on the answers above, but YMMV and I would test it thoroughly to make sure its aligning with expectation before relying on it.

import argparse
import sys


def parsecls(filename):
    with open(filename, 'r') as fh:
        content = fh.read()
        content = [item.rstrip("\n") for item in content.split(">")]
        d = {key: value.split(" ") for (key, value) in [item.split("\n") for item in content[1:]]}
        # 1: because an empty string is made during the split

    return d


def intersect(a, b):
    return {k: a[k] for k in a.keys() & b.keys()}


def main():
    try:
        parser = argparse.ArgumentParser(
            description="Comparing clusters between files."
        )
        parser.add_argument(
            "--reference",
            "-r",
            action="store",
            help="Comparison file 1.",
        )
        parser.add_argument(
            "--query",
            "-q",
            action="store",
            help="Comparison file 2.",
        )

        args = parser.parse_args()

    except NameError:
        sys.stderr.write(
            "An exception occured with argument parsing. Check your provided options."
        )

# Read in files, create sets based on clusters
    refclusters = parsecls(args.reference)
    print("Ref Clusters")
    [print(k,v) for k, v in refclusters.items()]
    queryclusters = parsecls(args.query)
    print("Query Clusters")
    [print(k,v) for k, v in queryclusters.items()]

# Perform comparisons (a dict intersection)
    print("Common Clusters")
    print(intersect(refclusters, queryclusters))


if __name__ == "__main__":
    main()

I've taken the strategy of doing an 'all-vs-reference' approach here. The script performs a 1-to-1 comparison, and returns the names of the clusters and their entries in common between the files (based on the values in the cluster). I think this will currently break if the values in the cluster move around, but that can be solved by adding some extra filtering and sorting steps I believe, but this works on the test data as far as I can tell.

It shouldn't be too horrendous from a memory perspective, but it is holding the file in memory while it's read. At 40Mb or so that's probably OK, but if it becomes a big challenge, then some tweaks can be made to use more iteration instead.

Use: python scriptname.py --reference MainFile.cls --query FileN.cls (short args also supported -r|-q)

To do the full comparison, wrap in a bash loop:

for file in /path/to/files/File*.cls ; do 
    python scriptname.py -r MainFile.cls -q "$file"
done

Example:

MainFile.cls:

>CL1 2
ler_1 lcu_2 ler_3
>CL2 3
lcu_6 ler_4 lcu_8
>CL3 3
lcu_6 lcu_4 lcu_8

FileB.cls

>CL1 2
ler_1 ler_2 ler_3
>CL2 3
ler_6 ler_4 ler_8
>CL4 8
ler_10 ler_0 ler_2

Run:

$ python cluster_comparison.py -r TestData/MainFile.cls -q TestData/FileB.cls
Ref Clusters
CL1 2 ['ler_1', 'lcu_2', 'ler_3']
CL2 3 ['lcu_6', 'ler_4', 'lcu_8']
CL3 3 ['lcu_6', 'lcu_4', 'lcu_8']
Query Clusters
CL1 2 ['ler_1', 'ler_2', 'ler_3']
CL2 3 ['ler_6', 'ler_4', 'ler_8']
CL4 8 ['ler_10', 'ler_0', 'ler_2']
Common Clusters
{'CL1 2': ['ler_1', 'lcu_2', 'ler_3'], 'CL2 3': ['lcu_6', 'ler_4', 'lcu_8']}