Creating bed files for consecutive genomic windows out from polymorphisms
1
0
Entering edit mode
6.9 years ago
spiral01 ▴ 110

I have a set of polymorphisms and their position on each chromosome in a text file. I also have vcf files of variants (1000 genomes). I want to identify the number of variants within 1kb windows of each polymorphism.

To do this I think I need to create bed files for each appropriate window.

So for example, if I have polymorphisms at the following positions:

2000
3000
4000
5000

The first bed file would show start and end positions:

1500 2499
2500 3499
3500 4499
4500 5499

Then the second bed file would show:

1000 2999
2000 3999
3000 4999
4000 5999

Because I'd like to know the number in same size windows I'd then subtract the number of variants found from the first bed file from the number in the second.

And so on and so forth for the length of the chromosome..

I know how to count the variants, I just don't know how to automate the creation of such a vast number of bed files. Does anyone have any suggestions?

SNP • 1.6k views
ADD COMMENT
0
Entering edit mode
6.9 years ago
spiral01 ▴ 110

I have ended up using a simple python pandas function. df is the table with the polymorphism position. window is the size of the window to use around each polymorphism. chromSize is the length of the chromosome in question, used to ensure the end co-ordinates do not exceed the length.

def makeBedFiles(df, window, chromSize, fileSuffix):
    #Add 'START' and 'END' columns that are windows of bed file
    df['START'] = np.where(df['POS'] - window >= 0, df['POS'] - window, 0)
    df['END'] = np.where(df['POS'] + window <= chromSize, df['POS'] + window, chromSize)
    #Keep only chromosome, start and end columns
    df = df[['CHROM', 'START', 'END']]
    fileName = fileSuffix + str(window) + '.txt'
    #Write df to file
    df.to_csv(fileName, header=None, index=None, sep='\t', mode='a')

Having done this I then loop through a range from 0 to the chromosome size (249250621) divided by the window size (500) - chromosome is Chr1 from hg19 1000genomes:

for i in range(0, 498502):
    makeBedFiles(table, window = (500*i), chromSize = 249250621, 
                 fileSuffix = ('str(i)))

I am sure there are better, cleaner and quicker ways to do this but so far this is all I have come up with.

ADD COMMENT

Login before adding your answer.

Traffic: 2586 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6