Hello,
I have two tables that I got from analysis with hmmsearch. Table 1 has a set of homologues of protein 1, stating a series of informations, including Start and End positions of the encoding gene. Table 2 has a set of homologues of protein 2, with the same informations.
The style of the table would be something like :
ID Source Nucleotide Accession Protein Protein Name Start Stop Strand Organism Strain Assembly
74488271 RefSeq NC_009933.1 WP_041661553.1 hypothetical protein 4265907 4267559 + Acaryochloris marina MBIC11017 MBIC11017 GCF_000018105.1
13866598 RefSeq NC_009927.1 WP_012167081.1 hypothetical protein 156877 157254 - Acaryochloris marina MBIC11017 MBIC11017 GCF_000018105.1
13867103 RefSeq NC_009928.1 WP_012167419.1 hypothetical protein 121712 122089 - Acaryochloris marina MBIC11017 MBIC11017 GCF_000018105.1
13865815 RefSeq NC_009925.1 WP_012166309.1 hypothetical protein 6255930 6256316 + Acaryochloris marina MBIC11017 MBIC11017 GCF_000018105.1
13867540 RefSeq NC_009930.1 WP_012167945.1 hypothetical protein 106295 106678 - Acaryochloris marina MBIC11017 MBIC11017 GCF_000018105.1
What I would like to do, is to compare row1 in table2 with every row in table 1 and if the Nucleotide ID matches, then compare Stop position in table 1 with Start position in Table2 and if the difference Start2-Stop1 is < 50, then I'd like the whole row to be written to a new table (i.e Basically. I only want proteins in Table 2 that are directly downstream of proteins in Table 1, within the same genome) Then the same process should be repeated for each row in table 1
I looked at different methods to try to do this both in python(with pandas( and R(with GenomicRanges and data.table) , but could not come up with a solution. Is this something feasible at all ?
Thanks
i dont understand where table 1 finishes and table 2 starts.
Please add a clear example input and a representative output.
You might be able to modify this code or modify your input and get the result that you need:
A: Best tool for finding Boundary Pairs
This might not be efficient if your tables are large, but you can store both of your tables in one Table, and use the first column as an indicator of what table you're using.