PYTHON: Differences between gene files: Result unexpected when genes have duplicates
1
0
Entering edit mode
2.1 years ago
ciaki • 0

Hi Everyone, I am a Biologist {NOT A PROGRAMMER} and trying to syntax my own code to find differences between my data files.

File1.txt: Orange, orange, apple, pear

File2.txt: pear, Pear, Kiwi

Output.txt: -Orange -Orange -apple -pear +pear +Pear +Kiwi

In this case lowercase "pear" is the only common fruit between my files and thus the output shows both +pear and -pear. But this is not extremely helpful because I want to use this code for really long gene lists. Is there some way to further filter the common fruit and display them for example without a "+" or "-" the output.txt. As this is not very helpful to have to go through what has + and - in a very big list full of duplicates.

this is my code:


>     import difflib
>     
>     with open('/Users/.../file1.txt') as file_1:file_1_text = file_1.readlines()
>     with open('/Users/.../file2.txt') as file_2:file_2_text = file_2.readlines()
>     
>     mfile = open('output.txt', 'w')    
>     
>     for line in difflib.unified_diff(file_1_text, file_2_text,fromfile='file1.txt',tofile='/Users/.../file2.txt',
> lineterm=''):    
>         mfile.write("%s\n" % line)    
>         print(line)
python • 1.2k views
ADD COMMENT
0
Entering edit mode

What you are trying to perform is commonly known as "Set operations" in programming. So this keyword should help you to google what you need - at the first glance, this tutorial seems quite appropriate.

ADD REPLY
0
Entering edit mode

If using python is not a requirement, an easy approach to find the common genes between two files could be first to convert your files and replace the commas by new lines:

sed 's/, /\n/g' file1 > file1_out.txt 
sed 's/, /\n/g' file2 > file2_out.txt

And then find the common elements between these two new files using grep:

grep -wFf file1_out.txt file2_out.txt > common.txt
ADD REPLY
0
Entering edit mode

It is not that I have to use python but is is preferable because when I fix it people in my lab will use it too !

ADD REPLY
0
Entering edit mode

I guess people in your lab could use bash just as they would use python? I personally find bash and awk faster and simpler when it comes to straightforward file parsing problems, as the current issue of finding common elements between two files.

ADD REPLY
0
Entering edit mode

Well, if your goal is not to learn Python, but to provide your lab with an easy way to intersect gene lists, then I would recommend a browser-based GUI approach.

Galaxy has a rudimentary intersection feature, but much nicer is Intervene (documentation), which can also create beautiful figures.

ADD REPLY
0
Entering edit mode
2.1 years ago
Joe 21k

You are basically looking for this:

https://stackoverflow.com/questions/9585218/python-find-common-text-in-two-files

ADD COMMENT
0
Entering edit mode

The OP may want to look into the string methods .lower() and .upper() in the Python documentation. Combining those in to the set building highlighted in that StackOverflow post can make it case insensitive. It may not matter with actual genes; however, with the included toy example it matters. And sometimes it is best to build it in to be sure you've eliminated the possibility of that issue arising.

ADD REPLY

Login before adding your answer.

Traffic: 1474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6