Technical question about python "to find the strings"
3
0
Entering edit mode
7.0 years ago
horsedog ▴ 60

Hi there, I have two files, file 1 looks like this :

NP_208181.1
NP_220259.1
NP_224629.1
WP_232131
WP_3432434
WP_2441241221

File 2 looks like this:

NP_208181.1,GCF_000008525.1
NP_212206.1,GCF_000008685.2
NP_213866.1,GCF_000008625.1
NP_219784.1,GCF_000008725.1
NP_220151.1,GCF_000008725.1
NP_220259.1,GCF_000008725.1
NP_224628.1,GCF_000008745.1
NP_224629.1,GCF_000008745.1
NP_224939.1,GCF_000008745.1

My purpose is to find which ID in file 1 is in file 2 too, so here we can see NP_208181.1, NP_220259.1, NP_224629.1 can be found in file two, followed by GCF blabla, i wrote a small script like this :

import re
with open("file1") as ID, open("file2") as data:
  for line1, line2 in zip(ID,data):
    if line1 in line2:
      print(line1)

However, the result was blank, which does not make sense, any one knows why? how t modify this script?

python • 1.9k views
ADD COMMENT
0
Entering edit mode

Without testing it I think you're zipping the 2 lines together from each file, so it's only comparing line1 in file 1 with line 2 in file 2, then line 2 with line 2 etc You'll need 2 loops for this to work as you've got it - e.g:

for line1 in ID:
    for line2 in data:
        if line1 in line2

and so on..

I'd look in to using the any and all python keywords though, they may help here.

If you're not bothered about using python specifically, you could do this in a single line (sort of) with grep:

while read ID; do grep "$line" Data_file.txt ; done < ID_file.txt
ADD REPLY
0
Entering edit mode

Hi, thanks for correction, but I tried , still blank, here is my new code:

with open("file") as ID, open("file2") as data:
    for line1 in ID:
        for line2 in data:
            if line1 in line2:
                print(line1)
ADD REPLY
0
Entering edit mode

I believe this answer is important: A: Technical question about python "to find the strings"

ADD REPLY
0
Entering edit mode

what about comm?

Modo de empleo: comm [OPCIÓN]... FICHERO1 FICHERO2
Compara los ficheros ordenados FICHERO1 y FICHERO2 línea por línea.

Sin ninguna opción, produce un resultado en tres columnas. La columna
uno contiene las líneas únicas al FICHERO1, la columna dos contiene
las líneas únicas al FICHERO2, y la columna tres contiene las líneas
comunes a ambos ficheros.

  -1              suppress column 1 (lines unique to FILE1)
  -2              suppress column 2 (lines unique to FILE2)
  -3              suppress column 3 (lines that appear in both files)

  --check-order     check that the input is correctly sorted, even
                      if all input lines are pairable
  --nocheck-order   do not check that the input is correctly sorted
  --output-delimiter=STR  separate columns with STR
      --help     muestra esta ayuda y finaliza
      --version  informa de la versión y finaliza

Note, comparisons honor the rules specified by `LC_COLLATE'.

Examples:
  comm -12 file1 file2  Print only lines present in both file1 and file2.
  comm -3  file1 file2  Print lines in file1 not in file2, and vice versa.
ADD REPLY
0
Entering edit mode

Fixed some duff logic in my answer, it should work for your case now.

ADD REPLY
2
Entering edit mode
7.0 years ago
jomo018 ▴ 730

First, you need to strip eol from the lines. For example line1.strip(). Second, with zip, you are testing line against corresponding line. This should catch the first line but none of the others.

ADD COMMENT
0
Entering edit mode

hello, do you mean by this?

with open("file") as ID, open("file2") as data:
    for line1.strip() in ID:
        for line2.strip() in data:
            if line1.strip() in line2.strip():
                print(line1.strip())

?

ADD REPLY
0
Entering edit mode
7.0 years ago
Joe 21k

Combining my comment and jomo018s point about the line ending character (line stripping is only necessary from file 1 since the strings are contained within the line of file 2, but I've done both here anyway):

#!/bin/python

# assume the script is named comparelines.py
# invoke with the ID file as the first commandline arg, 
# and data file as commandline arg 2

import sys

with open(sys.argv[1], 'r') as ID_file, open(sys.argv[2], 'r') as data_file:
    IDs = [ID.strip() for ID in ID_file]
    data = [line.strip() for line in data_file]

    result = [j for i in IDs for j in data if i in j]

    for each in result:
        print(each)

So

$ python comparelines.py IDs.txt data.txt

gives:

NP_208181.1,GCF_000008525.1
NP_220259.1,GCF_000008725.1
NP_224629.1,GCF_000008745.1

EDIT

Fixed it.

ADD COMMENT
0
Entering edit mode
6.9 years ago
shoujun.gu ▴ 380

I believe all your input file are actually csv file

thus, the most efficient way is: 1) read these file into dataframe 2) inner join the column you want

ADD COMMENT

Login before adding your answer.

Traffic: 1813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6