Assuming your reference genome is hg38
(replace it with the key for your assembly of choice), you can use bedops --chop
and awk
pretty easily:
$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 100000 - | awk -v FS="\t" -v OFS="\t" '{ print $2+1"-"$3 }' > 100k.1based.txt
The last window for each nuclear chromosome will almost certainly be less than 100k nt in size.
If you don't want that straggler to be in your text file, you can excise it by adding -x
to bedops --chop
, i.e., :
$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 100000 -x - | awk -v FS="\t" -v OFS="\t" '{ print $2+1"-"$3 }' > 100k.1based.noStragglingBin.txt
Further, you will have bins for each chromosome, which will lead to duplicates. It is not clear from your question how you intend to handle that. If you just want something for the largest chromosome, use awk
to filter on that chromosome:
$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '($1 == "chr1"){ print $1, "0", $2 }' | bedops --chop 100000 - | awk -v FS="\t" -v OFS="\t" '{ print $2+1"-"$3 }' > 100k.chr1.1based.txt
Learn to do things with standard input and output, it will make your life easier and your work faster!
Also, not to be pedantic, but these are not typically called sliding windows. Sliding would step over the genome at increments, leading to overlapping bins. These are disjoint windows. I'm mentioning this terminology as it can help you better communicate your experiment to others.
Hope this helps!
why is this not what you want, or, what other way would you want it to work?
I would like to have a file with the last sliding windows being the size of all the genome.