"big" Data csv search
2
1
Entering edit mode
9.8 years ago
hrbrt.sch ▴ 10

Hey,

I have an computational problem.

I'm using python to iterate over 2 csv files.

csv file1= contains (6-7) columns .. and the important column is an "rs ID" column from dbSNP.

csv file2= 3 columns, 2 of them are important, also the rs ID and a GENE symbol column

my problem:

now I want to search: is an rs ID from csv file 1 IN csv 2 ? if yes, take the gene symbol from csv file 2 and put it into csv file 1.

csv file 1= 1,3 gb, csv file 2 = 8.8 mb

I'm generation a dictionary in python from the csv file 2 and I use it to search in csv file 1.

Problem: for every row(rs ID) in the csv file 1, he iterate through the whole dictionary (8.8mb file)

That takes way to much time.... do you know an another approach to get this search faster? I thought a dictionary/hashtable would be good... but it is way to slow.

Maybe creating a Suffix Array from csv file 2 instead of using a dictionary?

Or are there some packages, other data structures in python (vectorization methods)?

I would be very grateful for your help!

python csv • 7.5k views
ADD COMMENT
5
Entering edit mode

Don't use python: sort both files using unix sort and then use unix join

ADD REPLY
0
Entering edit mode

But I don't want to join them directly. I want to do a pattern search.

csv file 2 has maybe 300000 entries, csv file 1 = millions

And and I have to search every "rs ID" from csv file 2 in csv file 1, if it is occuring/matching, take the gene symbole from csv file 2 to csv file 1 on position "x"

ADD REPLY
0
Entering edit mode

If you sort at least the smaller file, then searching through it becomes rapidly faster. If you sort both of them then you don't even need to store either of them in memory.

Also, if you really need it to be fast then python is probably not the best route. You'll end up getting vastly better performance in C/C++/etc.

ADD REPLY
2
Entering edit mode

That takes way to much time.... do you know an another approach to get this search faster? I thought a dictionary/hashtable would be good... but it is way to slow.

A search of a dictionary or hash table completes in O(1) or constant time. A search of a suffix tree completes in O(n) or linear time, and will generally be slower to construct than a dictionary, as well.

A key-value pairing on the two columns in your second CSV file should give you a fast lookup time for each key match with the first CSV file.

You might post your code and we can take a look. Using a third-party library should be unnecessary for this task.

ADD REPLY
0
Entering edit mode

I'm not sure if 'fast enough' is actually a problem here, if you only need to generate this dataset once, just let it cook overnight or while you're busy with something else. No sense in spending a few days writing a faster version of a script when you already have the data.

If you have to do this sort of thing frequently, I would use a database to solve this issue.

ADD REPLY
1
Entering edit mode

I think @Pierre Lindenbaum is right. Another option would be do it with "awk" in unix. If give an example of your table I could try to suggest you something.

ADD REPLY
0
Entering edit mode

Example:

csv file 1:

RS ID
rs4846029
rs12135483

csv file 2:

rs id     gene symbol
rs123      CASP3

And now I want to search the rs id "rs123" in the csv file 1, furthermore, if there is a match, write on position "x"(match) the gene symbol

ADD REPLY
0
Entering edit mode

sort both files on column rs/RS and use linux join.

ADD REPLY
0
Entering edit mode

But I can't use join.

csv file 2 has e.g. 10 million rows, so there are a lot of rs ID's which are occurring multiple times(the other columns has unique Id's on csv file 1, but not in csv file 2)!

if I just join them, there will be no right match, due to rs ID "rsxyz" occur for example 10 times

ADD REPLY
0
Entering edit mode

see option -v of 'join'

ADD REPLY
0
Entering edit mode

So in your csv file 2 there are only partial ID compared to csv file 1?

ADD REPLY
1
Entering edit mode

Shouldn't a Python dictionary (=a hash table) have O(1) lookup time? It should not have to iterate through 8.8 Mb as you write. Are you sure you implemented it correctly, or did I misunderstand something? I think your approach would work, by initially constructing a dictionary with all the rsids from file 2 as keys (with some arbitrary value like 1) and then just going through file 1 and checking for each line whether the dictionary has the rsid in question as a key.

ADD REPLY
0
Entering edit mode

I exactly did what you wrote. my dict has rs IDs as a key and the values are the gene symbols.

Here is an example code:

# row = row from csv file 1, genes= dict from csv file 2

for row in reader:
    for key,value in genes.iteritems():    
        if row['rs_id']==key:
            row['geneSymbol']=value
ADD REPLY
1
Entering edit mode

Try this and see if it helps:

for row in reader:
      if genes.has_key(row['rs_id']):
             (... do stuff)
ADD REPLY
0
Entering edit mode

I removed the "big data" tag, since a 1 gig file is too small to qualify.

ADD REPLY
0
Entering edit mode

use pandas to read csv files and query them.

ADD REPLY
1
Entering edit mode
9.8 years ago

With python/pandas:

import pandas
file1 = pandas.read_csv('file1')
file2 = pandas.read_csv('file1')
print(file1)
rs_list = file1['rs_id'].tolist()

# Get rows of file2 corresponding to rs ids in file1:
file2[file2['rs_id'].isin(rs_list)]

Alternatively:

file2.query('rs_id in rs_list')

In general, avoid parsing tabular files manually and access them through dictionaries. It is much more versatile to use pandas dataframes in these cases :-).

ADD COMMENT
1
Entering edit mode
9.8 years ago
pld 5.1k

Here's my take on it. It seems what you're doing now is identifying all of the annotations and storing them in memory and then writing to the disk, if all you're doing is a simple join you can skip this and write the symbol as you make your output file:

import csv
import sys

def __main__():
    example_csv_1 = sys.argv[1]
    example_csv_2 = sys.argv[2]
    csv_1_id_col  = int(sys.argv[3])
    csv_2_id_col  = int(sys.argv[4])
    csv_2_sym_col = int(sys.argv[5])
    outfile       = sys.argv[6]

    with open(example_csv_2, 'rb') as fi:
        data = list(csv.reader(fi, delimiter=','))

    symbols = dict()

    for val in data:
        try:
            symbols[val[csv_2_id_col]] == None
            print "Error: Duplicate symbols found for ", val[csv_2_id_col]
            sys.exit(1)

        except KeyError:
            symbols[val[csv_2_id_col]] = val[csv_2_sym_col]

    with open(example_csv_1, 'rb') as fi:
        data = list(csv.reader(fi, delimiter=','))

    with open(outfile, 'w') as fi:
        for d in data:
            id = d[csv_1_id_col]
            try:
                sym = symbols[id]
            except KeyError:
                print "Error no gene symbol found for ", id
                sys.exit(1)

            fi.write(','.join(d + [sym]))
            fi.write('\n')

    return 0

__main__()

csv1:

a,stuff
b,stuff
c,stuff
d,stuff

csv2:

a,symA
b,symB
c,symC
d,symD

output:

a,stuff,symA
b,stuff,symB
c,stuff,symC
d,stuff,symD

Again, if all you want in the end is a csv file with annotated with gene symbols, there's no need to make it any more complex than it has to. Just grab the symbols as you write your output file. This will save time since you don't have to iterate through your large CSV file twice. It will also save memory since you don't have any overhead associated with storing the gene symbol for each row (big CSV file) in memory.

This script assumes that each id is associated with a single symbol. If there are multiple symbols per ID it can be easily updated. If you know there aren't any genes with missing symbols (big CSV) or genes with multiple symbols (small CSV), you can omit the checks I have in there and gain a bit of performance.

ADD COMMENT

Login before adding your answer.

Traffic: 1713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6