Hey,
I have an computational problem.
I'm using python to iterate over 2 csv files.
csv file1= contains (6-7) columns .. and the important column is an "rs ID" column from dbSNP.
csv file2= 3 columns, 2 of them are important, also the rs ID and a GENE symbol column
my problem:
now I want to search: is an rs ID from csv file 1 IN csv 2 ? if yes, take the gene symbol from csv file 2 and put it into csv file 1.
csv file 1= 1,3 gb, csv file 2 = 8.8 mb
I'm generation a dictionary in python from the csv file 2 and I use it to search in csv file 1.
Problem: for every row(rs ID) in the csv file 1, he iterate through the whole dictionary (8.8mb file)
That takes way to much time.... do you know an another approach to get this search faster? I thought a dictionary/hashtable would be good... but it is way to slow.
Maybe creating a Suffix Array from csv file 2 instead of using a dictionary?
Or are there some packages, other data structures in python (vectorization methods)?
I would be very grateful for your help!
Don't use python: sort both files using unix
sort
and then use unixjoin
But I don't want to join them directly. I want to do a pattern search.
csv file 2 has maybe 300000 entries, csv file 1 = millions
And and I have to search every "rs ID" from csv file 2 in csv file 1, if it is occuring/matching, take the gene symbole from csv file 2 to csv file 1 on position "x"
If you sort at least the smaller file, then searching through it becomes rapidly faster. If you sort both of them then you don't even need to store either of them in memory.
Also, if you really need it to be fast then python is probably not the best route. You'll end up getting vastly better performance in C/C++/etc.
A search of a dictionary or hash table completes in O(1) or constant time. A search of a suffix tree completes in O(n) or linear time, and will generally be slower to construct than a dictionary, as well.
A key-value pairing on the two columns in your second CSV file should give you a fast lookup time for each key match with the first CSV file.
You might post your code and we can take a look. Using a third-party library should be unnecessary for this task.
I'm not sure if 'fast enough' is actually a problem here, if you only need to generate this dataset once, just let it cook overnight or while you're busy with something else. No sense in spending a few days writing a faster version of a script when you already have the data.
If you have to do this sort of thing frequently, I would use a database to solve this issue.
I think @Pierre Lindenbaum is right. Another option would be do it with "awk" in unix. If give an example of your table I could try to suggest you something.
Example:
csv file 1:
csv file 2:
And now I want to search the rs id "rs123" in the csv file 1, furthermore, if there is a match, write on position "x"(match) the gene symbol
sort both files on column rs/RS and use linux join.
But I can't use join.
csv file 2 has e.g. 10 million rows, so there are a lot of rs ID's which are occurring multiple times(the other columns has unique Id's on csv file 1, but not in csv file 2)!
if I just join them, there will be no right match, due to rs ID "rsxyz" occur for example 10 times
see option -v of 'join'
So in your csv file 2 there are only partial ID compared to csv file 1?
Shouldn't a Python dictionary (=a hash table) have O(1) lookup time? It should not have to iterate through 8.8 Mb as you write. Are you sure you implemented it correctly, or did I misunderstand something? I think your approach would work, by initially constructing a dictionary with all the rsids from file 2 as keys (with some arbitrary value like 1) and then just going through file 1 and checking for each line whether the dictionary has the rsid in question as a key.
I exactly did what you wrote. my dict has rs IDs as a key and the values are the gene symbols.
Here is an example code:
Try this and see if it helps:
I removed the "big data" tag, since a 1 gig file is too small to qualify.
use pandas to read csv files and query them.