Concatenating text files based on common indices
1
0
Entering edit mode
19 months ago

I have two text-files containing the abundances of genes in samples. However, one of the files measures the abundance of a greater variety of genes than the other, and therefore cannot be completely concatenated. Thus, I'm trying to concatenate the lines from the larger file that share an identical gene index as lines from the smaller file, such that:

  >df1
                Sample1 Sample2 Sample3  
   Gene1    0.001       0.002      0.003
   Gene2    0.001       0.002      0.003  
   Gene3    0.001       0.002      0.003 


  >df2
                Sample4 Sample5 Sample6  
   Gene1    0.001       0.002      0.003
   Gene1.1 0.001       0.002      0.003
   Gene2    0.001       0.002      0.003
   Gene2.1 0.001       0.002      0.003
   Gene3    0.001       0.002      0.003 


    >df1and2
                Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
   Gene1    0.001       0.002      0.003.   0.001       0.002      0.003
   Gene2    0.001       0.002      0.003    0.001       0.002      0.003
   Gene3    0.001       0.002      0.003    0.001       0.002      0.003

Suggestions in Python or Bash are both welcome. Thank you!

Bash Python • 867 views
ADD COMMENT
0
Entering edit mode

I've removed tags such as genetics, genes and bioinformatics. The last tag makes no sense - EVERY QUESTION here is related to bioinformatics.

ADD REPLY
1
Entering edit mode
19 months ago

In general this is called an inner join, which is easy using the pandas library in Python.

import pandas as pd

df1and2 = df1.merge(df2, how='inner', left_index=True, right_index=True)
ADD COMMENT
0
Entering edit mode

Thanks for your response! However, when I tried it with my text files, I get an error stating: AttributeError: 'str' object has no attribute 'merge', even though they are in a very similar format to my example. What could be causing this problem?

ADD REPLY
2
Entering edit mode

'str' object has no attribute 'merge'

It appears you are trying to merge file names rather than dataframes. You have to read in those files first such that dataframes are named df1 and df2 and then it should work.

ADD REPLY
0
Entering edit mode

Thanks for your advice!

ADD REPLY

Login before adding your answer.

Traffic: 1521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6