I have two tab delimited files. Each has two columns, one for chromosome and one for variant co-ordinate. I want to identify the number of positions in file2 that are within a specified window range of the positions in file1, and then check in the next window, and the next and so forth.
So if this is the first row from file 1:
1 10500
And this is from file 2:
1 10177
1 10352
1 10616
1 11008
1 11012
1 13110
1 13116
1 13118
1 13273
1 13550
2 10700
2 12321
2 34234
If the window is of size 1000, the following co-ordinates from file 2 fall within this window:
1 10177
1 10352
1 10616
The second window should then be 2000 in size and therefore these co-ordinates fall within this window:
1 10177
1 10352
1 10616
1 11008
1 11012
What I want to do is find the mean number of variants in each window. So for each variant in file 1, calculate the number of variants in a 1kb window, and then calculate the mean. I then want to do this for the larger 2kb window and so on and so forth to a specified window size (e.g. 100kb).
I have tried writing a python function to do this and then looping through the windows. The issue is that file1 has 11609 variants, and file2 has 13758644 so this is proving extremely time consuming.
I was wondering if anyone knows of any better way of doing this?
Below are my python functions. It is a bit long:
#Function counts number of variants in a window of specified size around a specified
#genomic position. nsPos is the variant position from file1, df is the dataframe of file2.
def vCount(nsPos, df, windowSize):
#If the window minimum is below 0, set to 0
if(nsPos - (windowSize/2) < 0):
low = 0
else:
low = nsPos - (windowSize/2)
high = high = nsPos + (windowSize/2)
#calculate the length (i.e. the number) of variants in the subsetted data.
diversity = len(df[(df['POS'] >= low) & (df['POS'] <= high)])
return(diversity)
#Function to apply vCount function across the entire positional column of df.
#NSdf is the dataframe of file1, variantsDF is dataframe from file2.
def windowAvg(NSdf, variantsDF, window):
x=NSdf.POS.apply(vCount, df=variantsDF, windowSize=window)
return(x)
I then run this second function through a nested loop to produce a plot:
def loopNplot(NSdf, variantsDF, window):
#store window number
windows = list(range(1,101))
#store diversity (the mean number of variants)
diversity = []
#Loop through windows appending to the list.
for i in range(window, 101000, window):
#Create a list to hold counts for each chromosome (which we then take the mean of)
tempList = []
#Loop through chromosomes
for j in NSdf['CHROM']:
#Subset file1 and file2 dfs for only the relevant chromosome.
subDF = vtable.loc[(vtable['CHROM'] == j)].copy()
#Subset all variants
subVar = all_variants.loc[(all_variants['CHROM'] == j)].copy()
#run function
x = windowAvg(subDF, subVar, i)
#convert series to list
x = x.tolist()
#extend templist to include x
tempList.extend(x)
#Append mean of tempList - counts of diversity - to list.
diversity.append(sum(tempList)/len(tempList))
#Copy diversity
divCopy = list(diversity)
#Add a new first index of 0
diversity = [0] + diversity
#Remove last index
diversity = diversity[:-1]
#x - diversity to give number of variants in each window
div = list(map(operator.sub, divCopy, diversity))
plt.scatter(windows, div)
plt.show()
@Pierre's VarintsInWindow tool.
thanks, but it only works with vcf :-)
your final aim is to produce a plot isn't it ? what is this plot ? (x and y axis ?)
Yes, sorry I didn't make this very clear.I want to plot window number on the x axis and div (the mean number of variants) on the y axis.
spiral01 : It is important that you validate/acknowledge (by up-voting/accepting) answers below. That is the only way we know that your original problem has been solved. You can (and should) test all answers suggested and accept all those that provide the expected result.