Hello,
I am analyzing the results from two different statistical tests of genome wide SNP data. Each test outputs a list of rows with score value and also the genomic position. I have already filtered the files so they contain only the highest scores for each test. I am interested in observing the concordance between the two tests. So I would like to use a python script to go through the two files and identify any genomic positions that are identified by both tests. However I do not want to be too stringent so I want the list to include any regions that are within 1mb of each other. I am a beginner programmer and am not sure how to make a loop that can identify matches conditional on being within 1mb of each other. Any regions that match this criteria should then be written into a new file.
Here are examples of the two files (no headers):
Rn34_2155934833 155934833 1.30383e+06 1.50241e+06 -0.141762
Rn34_2167031291 167031291 1.96651e+06 3.47144e+06 -0.568305
Rn34_2178882599 178882599 1.89353e+06 2.00596e+06 -0.0576771
and
2 90152 439 180348277.000000 40.978182 10.406474 0.150000
2 97679 311 195402277.000000 44.399545 19.557207 0.150000
2 102957 486 205958277.000000 46.798636 13.764937 0.150000
In the first file the 2nd column is physical distance (eg 155934833) and in the second file the 4th column is physical distance (eg 180348277.000000). So I suppose a loop that reads in and compares the 2nd column of each row in file 1 to all the 4th column values in the rows in file 2 and then outputs both rows if a match is found within 1 megabase would be perfect. Though I am sure more experienced people know tidier ways. The other columns contain score information, chromosome etc but I think are irrelevant for the required function.
Any help is really appreciated, especially in python as I am learning this language.
Many thanks,
Rubal7
by physical distance, you mean location on a chromosome?
Do you want to do a run time comparison? My guess is 1. (fastest) BEDTools, 2. R/IRanges, 3. python looping, 4. Galaxy