Detect Regions In A List That Are Within 1 Megabase Of Each Other
4
1
Entering edit mode
13.1 years ago
Rubal ▴ 350

Hello,

I am analyzing the results from two different statistical tests of genome wide SNP data. Each test outputs a list of rows with score value and also the genomic position. I have already filtered the files so they contain only the highest scores for each test. I am interested in observing the concordance between the two tests. So I would like to use a python script to go through the two files and identify any genomic positions that are identified by both tests. However I do not want to be too stringent so I want the list to include any regions that are within 1mb of each other. I am a beginner programmer and am not sure how to make a loop that can identify matches conditional on being within 1mb of each other. Any regions that match this criteria should then be written into a new file.

Here are examples of the two files (no headers):

Rn34_2155934833 155934833 1.30383e+06 1.50241e+06 -0.141762
Rn34_2167031291 167031291 1.96651e+06 3.47144e+06 -0.568305
Rn34_2178882599 178882599 1.89353e+06 2.00596e+06 -0.0576771

and

2 90152 439 180348277.000000 40.978182 10.406474 0.150000
2 97679 311 195402277.000000 44.399545 19.557207 0.150000
2 102957 486 205958277.000000 46.798636 13.764937 0.150000

In the first file the 2nd column is physical distance (eg 155934833) and in the second file the 4th column is physical distance (eg 180348277.000000). So I suppose a loop that reads in and compares the 2nd column of each row in file 1 to all the 4th column values in the rows in file 2 and then outputs both rows if a match is found within 1 megabase would be perfect. Though I am sure more experienced people know tidier ways. The other columns contain score information, chromosome etc but I think are irrelevant for the required function.

Any help is really appreciated, especially in python as I am learning this language.

Many thanks,

Rubal7

python snp genomics overlap • 4.7k views
ADD COMMENT
0
Entering edit mode

by physical distance, you mean location on a chromosome?

ADD REPLY
0
Entering edit mode

Do you want to do a run time comparison? My guess is 1. (fastest) BEDTools, 2. R/IRanges, 3. python looping, 4. Galaxy

ADD REPLY
3
Entering edit mode
13.1 years ago
brentp 24k

Convert your data into BED format (chrom, start, stop) and this becomes an easy problem thanks to BedTools.

You can use simple unix commands to get to BED format and PAD by your distance, then intersectBed will find the intersection of those padded regions.

#!/bin/bash

DIST=1000000
PATH=$PATH:/usr/local/src/bedtools/bin

#convert to bed. pad by 1/2 the distance.
awk -vd=$(($DIST/2)) 'BEGIN {OFS="\t"}{ print "chrFake",$2-d-1,$2+d,$0 }' file1.txt > file1.bed
awk -vd=$(($DIST/2)) 'BEGIN {OFS="\t"}{ print "chrFake",$4-d-1,$4+d,$0 }' file2.txt > file2.bed

# cut -f 4,8 works because the oringal data is space-seperated. you'll have 
# to adjust # if it's actually tab-delimited.
intersectBed -a file1.bed -b file2.bed -wo | cut -f 4,8 > theanswer.txt
ADD COMMENT
2
Entering edit mode
13.1 years ago

What you propose seems reasonable. However, it might be interesting to convert your files to gff format (using python, of course). Then, tools such as BEDtools, Galaxy, and others could be applied on the files directly to find the overlaps of interest.

For a more general solution to range queries in python, consider a library that includes one of several variations on interval trees such as bx-python.

ADD COMMENT
0
Entering edit mode

thanks, I'll look into this library.

ADD REPLY
2
Entering edit mode
13.1 years ago
Niek De Klein ★ 2.6k

easy but probably not so fast python solution:

#opeing file1 and the file to write to
file1 = open("file.txt", "r")
outfile = open("result.txt","w")
fileString = "" #empty string

#looping trough every line of file1
for line in file1:
    print line
    #opening second file
     file2 = open("file2.txt","r")
    #parsing the distance out of file 1
    physicDistance1 = int(line.split(" ")[1])   #looks like it's space delimited, if it's tab delimited use line.split(" \t")
    #for every line in file1, loop trough every line in file2
    for line2 in file2:
        #parse distance out of file 2
        physicDistance2 = int(line2.split(" ")[3].split(".")[0]) #or line2.split("\t")
        #if distance1 - distance 2 or distance2 - distance1 equal or smaller than 1000 
        if physicDistance1 - physicDistance2 <= 1000 or physicDistance2 - physicDistance1 <= 1000:
             #dont know what exactly you want in the new file, but I put the lines that are equal behind eachother
            fileString += str(line) +"\t" + str(line2)

outfile.write(fileString)
outfile.close()
ADD COMMENT
2
Entering edit mode

Niek, you need to open 'file2' within the file1 loop. Otherwise file2 will be ready completely on the first 'line' in file1 and will be empty for each subsequent line.

ADD REPLY
1
Entering edit mode

there are still quite a few syntax errors in there. =< should be <= and you need an extra ")" on the line where you define physicDistance2. You can also write directly to outfile rather than saving up a long string.

ADD REPLY
0
Entering edit mode

Thanks very much, I'll play around with this script and let you know if it does the job!

ADD REPLY
0
Entering edit mode

You're right, wrote this on the fly. I will edit it. Thanks!

ADD REPLY
0
Entering edit mode

I get a syntax error on the first =< on line 16. Any idea why?

ADD REPLY
0
Entering edit mode

@Rubal see brentp's comment. Got those syntaxes out.

ADD REPLY
0
Entering edit mode

I've tried with the new syntax and there is still a problem with this line, now the syntax error is at the = . Any ideas? Rest of it seems to work fine thanks

ADD REPLY
0
Entering edit mode

Hmm, a problem occurs that is maybe due to the many .0000s after the physical position values in one file. Anyone know what the root problem is likely to be here -see error message below:

File "Detectoverlappingregions.py", line 14, in <module> physicDistance2 = int(line2.split(" ")[3]) #or line2.split("\t") ValueError: invalid literal for int() with base 10: '4709301.000000

ADD REPLY
0
Entering edit mode

When there is a . in a value python can't make an int out of it. Are those zeros important? Then you could use float(line2.split(" ")[3]) instead of using the int() or if what is after the . is not important, you could use int(line2.split(" ")[3].split(".")[0])

ADD REPLY
0
Entering edit mode

Traceback (most recent call last): @Niek de Klein, thanks, now I think hopefully the final issue is this:

File "Detectoverlappingregions.py", line 20, in <module> outfile.write(fileString) AttributeError: 'tuple' object has no attribute 'write'

ADD REPLY
0
Entering edit mode

@Niek de Klein, thanks, now hopefully this is the last issue:

Traceback (most recent call last): File "Detectoverlappingregions.py", line 20, in <module> outfile.write(fileString) AttributeError: 'tuple' object has no attribute 'write'

ADD REPLY
0
Entering edit mode

Little mistake, forgot to put open infront of ("result.txt","w"), changed it now

ADD REPLY
0
Entering edit mode

Script runs (after changing '=' to '-' on the 'if PhysicDistance1' line. But the output only contains 1 line from the first file, which is repeated every other line (see below) along with what I believe are potentially the correct lines from the 2nd file. Any ideas how to fix this greatly appreciated:

Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826 1 85419 532 171085301.000000 30.506786 30.562953 0.150000 Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826

ADD REPLY
0
Entering edit mode

Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826 1 85419 532 171085301.000000 30.506786 30.562953 0.150000 Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826 1 77944 323 156135301.000000 27.837143 19.561030 0.150000 Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826 1 83239 296 166725301.000000 29.728214 30.324539 0.150000 Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826 1 85419 532 171085301.000000 30.506786 30.562953 0.150000

ADD REPLY
0
Entering edit mode

So the script seems to only print the first line of the first input file, but several different lines from the 2nd input file. Presumably to output the several different lines from the 2nd input file it is looking at more than the first line of the first file, but not sure how. Thanks for any help.

ADD REPLY
0
Entering edit mode

can you try it with adding +"n" at the end of fileString += str(line) +"t" + str(line2)+"n" (see the example above)

ADD REPLY
0
Entering edit mode

Thank Niek, but same output, only with an empty line spacing between each duplicate pair (still with 1st line being always the first line from file1 and the 2nd line being a different line from file2)

ADD REPLY
0
Entering edit mode

if you do print line in the first loop, what does it print?

ADD REPLY
0
Entering edit mode

sorry python newby, could you add in the example above how i implement your suggestion. thank you

ADD REPLY
0
Entering edit mode

It prints out all the lines in order from file1, with a space between each line

ADD REPLY
0
Entering edit mode

I would discourage such a 'brute-force' approach, without a supporting data structure, the performance of such a naive implementation is terrible.

ADD REPLY
0
Entering edit mode
13.1 years ago
Michael 55k

Oh, too tempting to give a quick R answer. Given your data is in file1.txt and file2.txt and the files have exactly the layout described in your example.

library(IRanges) # need to install this library before

## read in the input files efficiently using scan
int1 = scan(file="file1.txt", what=list(NULL, numeric(),NULL, NULL, NULL))
int2 = scan( file="file2.txt", what=list(NULL,NULL, NULL, numeric(), NULL, NULL, NULL))

## create the objects holding the genomic positions
ir1 = IRanges(start=as.integer(int1[[2]]), width=1) # use column 2 SNP is an interval of length 1
ir2 = IRanges(start=as.integer(int2[[4]]), width=1) # use column 4, make an integer of the float

## output the positions from file1 which overlap file2 with maximum gap of 1e6 bp, one per line
cat(start(ir1[countOverlaps(ir1, ir2, maxgap=1e6)>0]), file="", sep="\n")
## output postions matching from file2 in file1
cat(start(ir2[countOverlaps(ir2, ir1, maxgap=1e6)>0]), file="", sep="\n")

## quick'n'dirty solution to get the full output back:

ind1 = countOverlaps(ir1, ir2, maxgap=1e6)>0
ind2 = countOverlaps(ir2, ir1, maxgap=1e6)>0
write.table( read.table(file="file1.txt")[ind1,], col.names=F, quote=F, row.names=F, file=stdout() ) # replace stdout() with a filename, read ?read.table, ?write.table for more
write.table( read.table(file="file2.txt")[ind2,], col.names=F, quote=F, row.names=F, file=stdout() )

Note: add a filename to file="" to print to a file instead of to stdout. Your example didn't contain overlapping positions, then output is empty.

ADD COMMENT
0
Entering edit mode

great, this also seems to work, but can you make it output the entire row with the matching column, not just the column value?

ADD REPLY
0
Entering edit mode

sure xD, will take a while though

ADD REPLY
0
Entering edit mode

If you have time to figure it out, please do let me know! Cheers

ADD REPLY
0
Entering edit mode

here you are, you might have to adapt a little to fit your file format

ADD REPLY
0
Entering edit mode

@Michael Dondrup, thanks very much for all your help, this is working great

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6