Question

Matching IDs between 3+ files and specifying output using dictionaries in Python

0

Entering edit mode

23 months ago

jen ▴ 10

Hello all,

I have a code that is supposed to read a file 'filecontig,' take all the sequence IDs within that file, match those IDs to IDs in files 'filetaxa' and 'fileTPM' and output the taxonomical classifications as well as the transcripts per million that match each respective ID. I have almost achieved this. I am able to get the correct taxonomical outputs for all 6621 IDs in the filecontig file, but I have not been able to successfully match the TPM values. My TPM outputs all say 5.06, which is certainly incorrect. The TPM values differ significantly from ID to ID. I would greatly appreciate anyone helping me with this code as I've tried everything I can think of.

What the fileTPM file being read looks like: what the file TPM file looks like

The output of the python script below (correctly matches the TRINITY IDs and taxonomical classifications, not the TPM values): the output from running the python code below

dictionary = {} #defining the dictionary
dictionary2 = {}
outfile=open('con_taxa_fun_tpm','w')
filetaxa=open('FLRNA.taxonomy','r') fileTPM=open('/nas2/blue_crab_chunglab/mingli_data/RNA/RSEMFloridaRNA/RSEM.isoforms.results','r')                                                                                                             
filecontig=open('summedIDsfinal', 'r')

for line in filetaxa:
        line_lyst = line.split('        ') #splitting on the tab
        taxa = line_lyst[1] #use the second part
        referenceA = line_lyst[0] #because line_lyst[0] of the file matches a region of the other files
        dictionary[referenceA] = taxa

for line in fileTPM:
        if line.startswith('T'):
                line_list = line.split('        ')
                TPM = line_list[5]
                referenceC = line_list[0]
                dictionary2[referenceC] = TPM
                if referenceC in dictionary and dictionary2:
                        TPM = dictionary2[referenceC]

for line in filecontig:
        if line.startswith('T'):
                name_list = line.split('\n')
                query = name_list[0]
                referenceB = name_list[0]
                if referenceB in dictionary:
                        new = dictionary[referenceB]

                        outfile.write(query+'   '+new+' '+TPM+'\n')

filetaxa.close() #close the taxonomy file bc we are done using it
filecontig.close()
fileTPM.close()

So ultimately, I just want a code that outputs the IDs (TRINITY_*) in the filecontig file, the matching taxonomical classification in the filetaxa file, and the matching TPM in the fileTPM file all into one as-organized-as-possible file. I would also like to get rid of that weird spacing after the taxonomical classification as shown in the second screenshot (but this currently isn't my priority).

Also, for now, I am working with 1:1:1 matches, but I do want to add another column for an additional measurement in the future. However, I know that this file won't have matches for every single contig ID in the filecontig file. Are there any tricks for telling python to move onto the next contig ID or insert 'null' (or anything along those lines) if there isn't a match between the filecontig file and another file that I read in? I would still want the ID, classification, and TPM to get outputted in this case.

I am new to Python, and any help is appreciated. Thanks so much.

Dictionaries inforstatements Matching loops Python • 749 views

ADD COMMENT • link 23 months ago by jen ▴ 10

1

Entering edit mode

Basically, your goal is to join two tables based on a common entity and filter the results based on a column in the third table.

Of course, the pandas library can handle these tasks very easily.

But anyway, I didn't test it, but based on your code, maybe this or something similar can work

dictionary = {} #defining the dictionary
dictionary2 = {}
filetaxa = open('FLRNA.taxonomy','r') 
fileTPM = open('/nas2/blue_crab_chunglab/mingli_data/RNA/RSEMFloridaRNA/RSEM.isoforms.results','r')                                                                                                             
filecontig = open('summedIDsfinal', 'r')
outfile = open('con_taxa_fun_tpm','w')

for line in filetaxa:
    line_lyst = line.split('\t') #splitting on the tab
    referenceA = line_lyst[0].strip() #because line_lyst[0] of the file matches a region of the other files
    taxa = line_lyst[1] #use the second part
    dictionary[referenceA] = taxa


for line in fileTPM:
    if line.startswith('T'):
        line_list = line.split('\t')
        referenceC = line_list[0].strip()
        TPM = line_list[5]
        dictionary2[referenceC] = TPM
        # if referenceC in dictionary and dictionary2:
        #         TPM = dictionary2[referenceC]

for line in filecontig:
    if line.startswith('T'):
        query = line.strip()
        # query = name_list[0]
        # referenceB = name_list[0]
        # if referenceB in dictionary:
        #         new = dictionary[referenceB]

        outfile.write(query+'\t'+dictionary.get(query,"")+'\t'+dictionary2.get(query,"")+'\n')

filetaxa.close() #close the taxonomy file bc we are done using it
filecontig.close()
fileTPM.close()